Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Statistics for Business

This page intentionally left blank

Statistics for Business

Derek L Waller

AMSTERDAM • BOSTON • HEIDELBERG • LONDON • NEW YORK • OXFORD PARIS • SAN DIEGO • SAN FRANCISCO • SYDNEY • TOKYO

Butterworth-Heinemann is an imprint of Elsevier

Butterworth-Heinemann is an imprint of Elsevier Linacre House, Jordan Hill, Oxford OX2 8DP, UK 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA First edition 2008 Copyright © 2008, Derek L Waller Published by Elsevier Inc. All rights reserved The right of Derek L Waller to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone ( 44) (0) 1865 843830; fax ( 44) (0) 1865 853333; email: permissions@elsevier.com. Alternatively you can submit your request online by visiting the Elsevier web site at http:/ /elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-7506-8660-0 For information on all Butterworth-Heinemann publications visit our web site at books.elsevier.com

Typeset by Charon Tec Ltd (A Macmillan Company), Chennai, India Printed and bound in Great Britain 08 09 10 10 9 8 7 6 5 4 3 2 1

This textbook is dedicated to my family, Christine, Delphine, and Guillaume. To the many students who have taken a course in business statistics with me … You might find that your name crops up somewhere in this text!

This page intentionally left blank

Contents

About this book ix Using a Normal Distribution to Approximate a Binomial Distribution Chapter Summary Exercise Problems

1

**Presenting and organizing data
**

Numerical Data Categorical Data Chapter Summary Exercise Problems

1

3 15 23 25

169 172 174

6

**Theory and methods of statistical sampling
**

Statistical Relationships in Sampling for the Mean Sampling for the Means from an Infinite Population Sampling for the Means from a Finite Population Sampling Distribution of the Proportion Sampling Methods Chapter Summary Exercise Problems

185

187 196 199 203 206 211 213

2

**Characterizing and defining data
**

Central Tendency of Data Dispersion of Data Quartiles Percentiles Chapter Summary Exercise Problems

45

47 53 57 60 63 65

3

**Basic probability and counting rules 79
**

Basic Probability Rules System Reliability and Probability Counting Rules Chapter Summary Exercise Problems 81 93 99 103 105

7

**Estimating population characteristics
**

Estimating the Mean Value Estimating the Mean Using the Student-t Distribution Estimating and Auditing Estimating the Proportion Margin of Error and Levels of Confidence Chapter Summary Exercise Problems

229

231 237 243 245 248 251 253

4

**Probability analysis for discrete data 119
**

Distribution for Discrete Random Variables Binomial Distribution Poisson Distribution Chapter Summary Exercise Problems 120 127 130 134 136

5

**Probability analysis in the normal distribution
**

Describing the Normal Distribution Demonstrating That Data Follow a Normal Distribution

8 149

150 161

Hypothesis testing of a single population

263

Concept of Hypothesis Testing 264 Hypothesis Testing for the Mean Value 265 Hypothesis Testing for Proportions 272

viii

Contents The Probability Value in Testing Hypothesis Risks in Hypothesis Testing Chapter Summary Exercise Problems Forecasting Using Non-linear Regression Seasonal Patterns in Forecasting Considerations in Statistical Forecasting Chapter Summary Exercise Problems

274 276 279 281

351 353 360 364 366

9

**Hypothesis testing for different populations
**

Difference Between the Mean of Two Independent Populations Differences of the Means Between Dependent or Paired Populations Difference Between the Proportions of Two Populations with Large Samples Chi-Square Test for Dependency Chapter Summary Exercise Problems

301

302 309

11

**Indexing as a method for data analysis
**

Relative Time-Based Indexes Relative Regional Indexes Weighting the Index Number Chapter Summary Exercise Problems Appendix I: Key Terminology and Formula in Statistics Appendix II: Guide for Using Microsoft Excel 2003 in This Textbook Appendix III: Mathematical Relationships Appendix IV: Answers to End of Chapter Exercises Bibliography Index

383

385 391 392 397 398

311 313 319 321

413

10

**Forecasting and estimating from correlated data
**

A Time Series and Correlation Linear Regression in a Time Series Data Linear Regression and Causal Forecasting Forecasting Using Multiple Regression

333

335 339 345 347

429 437 449 509 511

**About this book
**

This textbook, Statistics for Business, explains clearly in a readable, step-by-step approach the fundamentals of statistical analysis particularly oriented towards business situations. Much of the information can be covered in an intensive semester course or alternatively, some of the material can be eliminated when a programme is on a quarterly basis. The following paragraphs outline the objectives and approach of this book. strategy and importantly evaluate expected financial risk. Market surveys are useful to evaluate the probable success of new products or innovative processes. Operations managers in services and manufacturing, use statistical process control for monitoring and controlling performance. In all companies, historical data are used to develop sales forecasts, budgets, capacity requirements, or personnel needs. In finance, managers analyse company stocks, financial performance, or the economic outlook for investment purposes. For firms like General Electric, Motorola, Caterpillar, Gillette (now a subsidiary of Procter & Gamble), or AXA (Insurance), six-sigma quality, which is founded on statistics, is part of the company management culture!

**The subject of statistics
**

Statistics includes the collecting, organizing, and analysing of data for describing situations and often for the purposes of decision-making. Usually the data collected are quantitative, or numerical, but information can also be categorical or qualitative. However, any qualitative data can subsequently be made quantitative by using a numerically scaled questionnaire where subjective responses correspond to an established number scale. Statistical analysis is fundamental in the business environment as logical decisions are based on quantitative data. Quite simply, if you cannot express what you know, your current situation, or the future outlook, in the form of numbers, you really do not know much about it. And, if you do not know much about it, you cannot manage it. Without numbers, you are just another person with an opinion! This is where statistics plays a role and why it is important to study the subject. For example, by simply displaying statistical data in a visual form you can convince your manager or your client. By using probability analysis you can test your company’s

Chapter organization

There are 11 chapters and each one presents a subject area – organization of information, characteristics of data, probability basics, discrete data, the normal distribution, sampling, estimating, hypothesis testing for single, and multiple populations, forecasting and correlation, and data indexing. Each chapter begins with a box opener illustrating a situation where the particular subject area might be encountered. Following the box opener are the learning objectives, which highlight the principal themes that you will study in the chapter indicating also the subtopics of each theme. These subtopics underscore the elements that you will cover. Finally, at the end of each chapter is a summary organized according to the principal themes. Thus, the box opener, the learning objectives, the chapter itself, and the

or chi-square values as all of these are contained in the Microsoft Excel package. the footers. What I have done with these Excel screen graphs (or screen dumps as they are sometimes disparagingly called) is to tidy them up by removing the tool bar. as underscored by these chapter exercises. in Appendix III is a section that covers all the arithmetical terms and equations that provide all the basics (and more) for statistical analysis. Basic mathematics You may feel a little rusty about your basic mathematics that you did in secondary school. Thus.) . and equations that are highlighted in bold letters throughout the text. histograms and pie charts. you will find reference to all the appropriate statistical functions employed. (Note. Microsoft excel This text is entirely based on Microsoft Excel with its interactive spreadsheets. The emphasis of this textbook. rather than other statistical packages. jargon.x About this book chapter summary are logically and conveniently linked that will facilitate navigation and retention of each chapter subject area. A calculator will round numbers. Associated with this paragraph are several Excel screens giving the stepwise procedure to develop graphs from a particular set of data. This is because all the examples and exercises have been calculated using Excel. is on practical business applications. in the paragraph “Using the Excel Functions”. Student-t. These functions contain all the mathematical and statistical relationships such as the normal. and built-in macro-functions. graphical capabilities. in the text you may find that if you perform the application examples and test exercises on a calculator you may find slightly different answers than those presented in the textbook. and the numerical column and the alphabetic line headings to give an uncluttered graph. A guide of how to use these Excel functions is contained in Appendix II. All these have been developed from data using an Excel spreadsheet and this data has then been converted into the desired graph. Further there are numerous multipart end-of-chapter exercises and a case. In this case. this textbook does not include any of the classic statistical tables such as the standardized normal distribution. These Excel graphs in PowerPoint format are available on the Web. and Student-t distributions. when you have completed this book you will have gained a double competence – understanding business statistics and versatility in using Excel! Glossary Like many business subjects. line graphs. A guide of how to make these Excel graphs is given also in Appendix II in the paragraph. For this reason. The 11 chapters in this book contain numerous tables. over 300. I have chosen Excel as the cornerstone of this book. All of these examples and exercises are based on Microsoft Excel. These definitions and equations. Poisson. Worked examples and end-of-chapter exercises In every chapter there are worked examples to aid comprehension of concepts. are all compiled in an alphabetic glossary in Appendix I. The related Table E-2 then gives a listing and the purpose of all the functions used in this text. The answers for the exercises are given in Appendix IV and the databases for these exercises and the worked examples are contained on the enclosed CD. “Generating Excel Graphs”. As you work through the chapters in this book. which carries up to 14 figures after the decimal point. binomial. as in my experience Excel is a major working tool in business. statistics contains many definitions.

. and more fun. please do not hesitate to get in touch through the Elsevier website or at my e-mail address: derek.waller@wanadoo. and cases from various countries where the $US. If you need any further information. exercises. The author I have been in industry for over 20 years using statistics and then teaching the subject for the last 21 with considerable success using the subject material. I often hear remarks like: “I will never pass this course. The subject is made easier. Euro. This text avoids working through tedious mathematical computations. You should not have any qualms about studying statistics – it really is not a difficult subject to grasp. This textbook recognizes this by using box openers. examples. To aid comprehension. and that can be covered in a reasonable time frame. You will find the book shorter than many of the texts on the market but I have only presented those subject areas that in my experience give a solid foundation of statistical analysis for business. by using Microsoft Excel. I plan to take a job in human resources?” All these remarks are unwarranted and the knowledge of statistics is vital in all areas of business. and Pound Sterling are employed. the textbook begins with fundamental ideas and then moves into more complex areas.” “I don’t need a course in statistics as I am going to be in marketing.” “I am no good at maths and so I am sure I will fail the exam.” “What good is statistics to me.About this book xi International The business environment is global.fr. Learning statistics Often students become afraid when they realize that they have to take a course in statistics as part of their college or university curriculum. and the approach given in this text. or have questions to ask. often found in other statistical texts that I find which confuse students.

This page intentionally left blank .

and Helen three of the regional sales directors. and representatives from production and product development. would you kindly interpret that data”. Susan. the Head of Marketing. “I am sorry Steve but you just have to remember that all of our people are busy and need to be presented information that gives them a clear and concise picture of the situation.Presenting and organizing data 1 How not to present data Steve was an undergraduate business student and currently performing a 6-month internship with Telephone Co. Valerie Jones. “What does all that mean?” “I just don’t understand the significance of those figures?” “Sir. There were 10 people in the meeting including Roger. . After the meeting Valerie took Steve aside and said. At first there was silence and then there were several very pointed comments.1 with the comment that “This is the 200 pieces of raw sales data that I have collected”. Today he was feeling nervous as he was about to present the results of a marketing study that he had performed on the sales of mobile telephones that his firm produced. Steve showed his first slide as illustrated in Table 1. Steve’s manager. The way that you presented the information is not at all what we expect”.

489 106.796 123.498 107.985 86.985 92.975 82.378 109.598 134.785 108.128 77.897 68.895 145.897 134.597 120.583 150.653 99.498 56.875 137.564 79.598 77.654 70.847 124.128 135.985 91.865 103.486 85.654 62.598 80.632 123.589 136.479 73.432 140.597 85.698 78.584 97.985 80.890 72.654 100.562 84.854 76.259 65.984 74.598 133.492 103.987 59.985 123.654 97.894 93.958 89.825 165.479 116.598 128.598 68.987 77.459 138.999 90.128 122.985 132.786 132.326 99.489 82.578 161.958 54.298 112.258 162.298 88.598 153.987 58.564 113.945 84.562 89.695 105.569 104.987 72.859 67.895 81.265 142.295 95.754 101.298 61.569 139.654 124.698 81.598 129.879 126.654 118.975 134.597 99.965 184.456 143.865 163.897 70.592 108.590 89.589 152.298 92.897 73.895 100.859 121.985 127.985 96.865 164.864 160.486 131.215 107.695 89.564 93.695 82.759 58.562 156.259 64.958 71.598 66.128 86.659 125.689 101.456 96.489 87.856 110.958 106.584 133.459 96.785 74.958 63.986 60.984 90.298 111.985 103.895 63.958 102.856 66.498 88.459 136.975 76.592 107.895 102.856 64.312 119.694 106.695 101.189 124.698 45.589 105.957 117.985 122.987 60.598 95.654 69. 170.985 104.987 86.198 110.587 104.832 121.895 52.654 149.987 165.874 .987 87.489 69.1 35.459 78.876 87.598 78.598 47.958 113.569 184.489 Raw sales data ($).592 94.651 98.894 75.2 Statistics for Business Table 1.985 151.985 47.568 103.689 114.982 50.189 101.896 109.289 130.296 71.895 102.298 115.980 120.856 144.295 83.487 81.456 141.895 97.856 83.958 168.458 112.598 120.976 100.957 91.859 146.598 55.157 98.490 138.562 37.598 85.865 152.964 56.

✔ ✔ Numerical data • Types of numerical data • Frequency distribution • Absolute frequency histogram • Relative frequency histogram • Frequency polygon • Ogive • Stem-and-leaf display • Line graph Categorical data • Questionnaires • Pie chart • Vertical histogram • Parallel histogram • Horizontal bar chart • Parallel bar chart • Pareto diagram • Cross-classification or contingency table • Stacked histogram • Pictograms As the box opener illustrates.Chapter 1: Presenting and organizing data 3 Learning objectives After you have studied this chapter you will be able to logically organize and present statistical data in a visual form so that you can convince your audience and objectively get your point across. x and y. is to put it into a frequency distribution. which is collected information that has not been organized.000. You will learn how to develop the following support tools for both numerical and categorical data accordingly as follows.500. The information presented in Table 1. Numerical Data Numerical data provide information in a quantitative form. or how often. to make it easier to understand. in the business environment. A guide is to have at least 5 classes but no more than 15 although it really depends on the amount of data . Management people are busy and often do not have the time to make an in depth analysis of information. Types of numerical data Numerical data are most often either univariate or bivariate. the data are more manageable than raw data and we can demonstrate clearly patterns in the information. Thus a simple and coherent presentation is vital in order to get your message across. All these give information in a numerical form and clearly state a particular condition or situation. Frequency distribution One way of organizing univariate data. A frequency distribution is a table. it is vital to show data in a clear and precise manner so that everyone concerned understands the ideas and arguments being presented. x. For example. When data is collected it might be raw data. My gross salary last year was £70. categories. where the data are arranged into unique groups. The firm’s net income last year was $14. that can be converted into a graph. The next step after you have raw data is to organize this information and present it in a meaningful form. Univariate data are composed of individual values that represent just one random variable. and any data that is subsequently put into graphical form would be bivariate since a value on the x-axis has a corresponding value on the y-axis. Bivariate data involves two variables. Usually the greater the quantity of data then there should be more classes to clearly show the profile. He ran the Santa Monica marathon in 3 hours and 4 minutes. the house has 250 m2 of living space.1 is univariate data. or classes according to the frequency. data values appear in a given class.000 and this year it has increased to £76. This section gives useful ways to present numerical data. By grouping data into classes.400.

00 9.000 to 55.000 145.00 6.000 to 45.000 to 105.000 75.00 Midpoint of class range 30.000 80.000 45. to develop a frequency distribution for these sales data.000 105.00 12.378).000 to 115.00 9.000 150.000 110.00 7.000 to 65. By using these upper and lower boundary limits we have included all of the 200 data items.000 50. In selecting a lower value of $35.000 to 195.000 55.4 Statistics for Business available and what we are trying to demonstrate. The class range or class width is given by the following relationship: Class range or class width Desired range of the complete frequency distribution Number of groups selected r 1(i) maximum value for presenting this data is $185.000 65.00 11.000 to 135. Let us consider the sales data given in Table 1. the class range or width should be the same such that there is coherency in data analysis.00 100.000 90.000 and each class increase by intervals of $10. Class range ($) Amount of data in class 0 2 6 14 18 22 24 30 20 18 14 12 8 6 4 2 0 200 Percentage of data 0.000 100.000 35.000 .000 190.000 130.2 Class no.000 125. If we use the [function MIN] in Excel it gives the lowest value of $35.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total 25.00 0.000 (the nearest value in ’000s below $35.1.000 170.378.000 135.000 165.00 4.000 40.00 1.000 to 175. 000 15 $10.957 as the highest value of this data.000 to 85.000 160.000 to 165.00 2.000 and an upper Frequency distribution of sales data. In the frequency distribution.00 10. If we want 15 classes then the class range or class width is given as follows using equation 1(i): Class range or class width $185. we obtain $184.000 to the upper limit of $185. The tabulated frequency distribution for the sales data using 15 classes is shown in Table 1.000 175.000 (the nearest value in ’000s above $184. When we develop a frequency distribution we want to be sure that all of the data is contained within the boundaries that we establish.000 to 35. 000 $35. The 1st column gives the number of the class range.000 70.000 155.000 to 155.000 to 145.00 15.000 140.2.00 3.000 to 75.00 7. and the 3rd column gives the amount of data in each range. Thus. the 2nd gives the limits of the class range.000 120.000 115.00 3.000 to 125. 000 The range is the difference between the highest and the lowest value of any set of data.00 1. If we use the [function MAX] in Excel.000 185.000 60.000 to 95.000 85.000 180. The lower limit of the distribution is $35.000 95.957) and a minimum value is $35.000. a logical Table 1.000 to 185.

is obtained by dividing the amount of data in a particular class by the total amount of data.000)/16].2. Note that in this example. The reason for this will be explained in the later section entitled.000. etc. 5 Absolute frequency histogram Once a frequency distribution table has been developed we can convert this into a histogram. defined by the y-axis. as shown in the 4th column of Table 1. Then select [function FREQUENCY] in Excel and enter the dataset.000 in increments of $10. However.000 [(190.375 [(185.000 falls in the class range. (Note that in Table 1. The first bar contains data in the range $35.000 – 35.000 to $45.000.001 is in the class range $45. in the class width $45. whereas $45.000 which brings us back to a class range of $10. the third in the range $55. The vertical bars. using the graphics capabilities in Excel.000 to $55. and the class limits you developed that are demanded by the Excel screen.000.000. $35. Above each bar is indicated the amount of .2.Chapter 1: Presenting and organizing data value of $185. The value of $45.000 [(195. Note that once you have created a frequency table or graph you are now making a presentation in bivariate form as all the x values have a corresponding y value.000)/16].000 which again gives a class range of $10.000 to $65.000. This is a relative frequency distribution meaning that the percentage value is relative to the total amount of data available. In this case the class limits are $35.000. Note in the frequency distribution the cut-off points for the class limits. Figure 1. or to the frequency of the amount of data that occurs in a given class range. In either case we still maintain a closed-limit frequency distribution. “Frequency polygon”. and so the frequency distribution is called a closed-ended frequency distribution as all data is contained within the limits.000 – 35.8 ] simultaneously and this will give a frequency distribution of the amount of the data as shown in the 3rd column of Table 1. That is to say. which is a visual presentation of the information.1 gives an absolute frequency histogram for the sales data using the 3rd column from Table 1.000.2 we have included a line below $35. when we calculated the class range or class width using the maximum and the minimum values for 15 classes we obtained a whole number of $10. or x-axis.000 we have included all the sales data values. or proportion of data. if we wanted 16 classes then the class range would be $9. An absolute frequency histogram is a vertical bar chart drawn on an x. The horizontal.000 of a class range 185.) In order to develop the frequency distribution using Excel. we can keep the minimum value at $35.000 to $55. of exactly the same height and with exactly the corresponding lines as the class limits. Alternatively. You then highlight a virgin column.2.000. is a numerical scale of the desired class width where each class is of equal size.000 – 30. the second bar has data in the range $45.000 of a class range 25.000 to 195. you first make a single column of the class limits either in the same tab as the dataset or if you prefer in a separate tab. that is the information in Table 1.00%.and y-axes.000 to $55. have a length proportional to the actual quantity of data.000 to $185.000 to 35.000 and $45. you press the three keys.000 and 30.1. For example. immediately adjacent to the class limits. Whole numbers such as this make for clear presentations. Here we have 15 vertical bars whose lengths are proportional to the amount of contained data.↑ . When these have been selected. there are six pieces of data and 6/200 is 3. The percentage.000 and make the maximum value $195.000)/16] which is not as convenient.000 and a line above $185. In this case we can modify our maximum and minimum values to say 190. control-shift-enter [Ctrl . the lengths of the vertical bars are dependent on.000. the range selected by our class width. or a function of.

The relative frequency histogram of the sales data is given in Figure 1. 30 to be exact. can be converted into 12 16 5 to . Frequency polygon The absolute frequency histogram.000 the proportion of the total sales data is 15%.2. represented by the y-axis. or the relative frequency histogram.000 to $105. The shape of this histogram is identical to the histogram in Figure 1.000 to $ 105. is the percentage or proportion of the total data rather than the absolute amount.000.6 Statistics for Business Figure 1. Relative frequency histogram Again using the graphics capabilities in Excel we can develop a relative frequency histogram.000. In presenting this information to say.1 Absolute frequency distribution of sales data. 32 30 28 26 Amount of data in this range 24 22 20 18 16 14 12 10 8 6 4 2 0 2 0 45 55 75 85 35 65 95 5 5 5 5 5 5 5 5 5 18 10 11 12 13 14 15 16 17 30 24 22 20 18 14 18 14 12 8 6 6 4 2 0 18 5 to to to to to to to to to to to to 5 17 to 55 35 45 65 75 85 5 5 to 5 5 5 95 10 11 5 13 14 15 $’000s data that is included in each class range. lies in the range $95. We now see that for revenues in the range $95. we can clearly see the pattern of the data and specifically observe that the amount of sales in each class range increases and then decreases beyond $105.1. the sales department.2 where we have used the percent of data from the 4th column of Table 1. which is an alternative to the absolute frequency histogram where now the vertical bar. We can see that the greatest amount of sales of the sample of 200. There is no space shown between each bar since the class ranges move from one limit to another though each limit has a definite cut-off point.

00 45 55 65 75 35 85 95 5 5 5 5 5 5 5 5 5 18 10 11 12 13 14 15 16 17 to to to to to to to to to to to to to 35 45 55 65 75 85 to 95 5 5 5 5 5 5 5 10 11 12 13 14 15 $’000s a line graph or frequency polygon. The frequency polygon is developed by determining the midpoint of the class widths in the respective histogram.000 to $35.00 11 10.00 9 8 7.00 4 3. 000) 2 200.000 and an entry of $185.00 7.00 1 0 0. (maximum value 2 For example. 000 105. These polygons are developed using the graphics capabilities in Excel where the x-axis is the midpoint of the class width and the y-axis is the frequency of occurrence. the midpoint of the class range.00 0.00 Percent of data in this range 12 11. (95. In doing this we are able to construct a frequency polygon. Note that we have given an entry. Note that the relative frequency polygon has an identical form as the absolute frequency polygon of Figure 1.00 2. The midpoint of a class range is.4.3 but the 16 17 5 to 18 5 .000 to $195.00 1.00 6.000 is. $95.2. 000 minimum value) The midpoints of all the class ranges are given in the 5th column of Table 1.00 15.00 7 6 5 4.2 Relative frequency distribution of sales data. which cuts the x-axis for a y-value of zero.00 3 2 1. 16 15 14 13 12.000 where here the amount of data in these class ranges is zero since in these ranges we are beyond the limits of the closed-ended frequency distribution.Chapter 1: Presenting and organizing data 7 Figure 1.00 9. $25.00 3.00 10 9. 000 2 100.3 gives the absolute frequency polygon and the relative frequency polygon is shown in Figure 1.000 to $105. Figure 1.

00 00 0 12 10 15 13 11 14 16 Average between upper and lower values (midpoint of class) Figure 1.4 Relative frequency polygon of sales data. 0 00 00 00 00 00 00 0 00 . 00 0 Average between upper and lower values (midpoint of range) 17 18 . 30 40 50 70 60 80 90 0.0 00 90 . 0. 00 12 0 0. 00 16 0 0. 19 0.0 00 40 .3 Absolute frequency polygon of sales data. 0. 00 19 0 0.0 00 50 . 00 15 0 0. 0. 00 11 0 0. 00 13 0 0.0 .0 00 70 .0 00 80 . 35 30 25 Frequency 20 15 10 5 0 00 00 00 00 00 00 00 0 0 0 0 0 0 0 00 0. 16 15 14 13 12 11 Frequency (%) 10 9 8 7 6 5 4 3 2 1 0 30 . 00 17 0 0. 0.8 Statistics for Business Figure 1. 00 14 0 0. 0.0 .0 . 00 18 0 0.0 .0 .0 00 10 0.0 0.0 .0 00 60 . 0.

00 100. from the less than ogive. in graphical form.00 11.000 125.00 16.00 7.000 45. The difference between presenting the data as a frequency polygon rather than a histogram is that you can see the continuous flow of the data. has a positive slope Table 1.00 3.00 10.2 has been converted into an ogive format and this is given in Table 1. Range of class limits (‘000s) Ogive using absolute data No.00 15.00 9.00 43.00 7.00 57.00 6. we can 9 Ogive An ogive is an adaptation of a frequency distribution. where the y values decrease from left to right.3.00 84. This ogive.Chapter 1: Presenting and organizing data y-axis is a percentage.000 145.000.00 68.00 9.00 42.00 0. or the proportion of. developed from this data.000 165.00 99. Alternatively. n limit age n age class limit but (n 1) limit. rather than an absolute scale.00 96.000 35.00 4.000 195. or cumulated.00 90.000 105.00 2.00 20.00 23. from the greater than ogive we can see that 80.00 97.000 85.000 185. class Number Percentage Percentage Percentage 1) limit.00 1.00 25.00 100.00 3.00 4.00 1. where the data values are progressively totalled. The frequency distribution data from Table 1. which indicates the amount of data below certain limits. n 200 198 192 178 160 138 114 84 64 46 32 20 12 6 2 0 0 2 8 22 40 62 86 116 136 154 168 180 188 194 198 200 0. are given in Figure 1.00 0.00 0. n Ogives of sales data. For example. It has a negative slope.000 95.000 Total 35 35 to 45 45 to 55 55 to 65 65 to 75 75 to 85 85 to 95 95 to 105 105 to 115 115 to 125 125 to 135 135 to 145 145 to 155 155 to 165 165 to 175 175 to 185 185 0 2 6 14 18 22 24 30 20 18 14 12 8 6 4 2 0 200 .00 10.000 135.000 115.00 1.000 55.00 32.00 31.000 75. observations that lie above or below certain limits.5. There is a less than ogive.00% of the sales revenues are at least $75.00 3. The other is a greater than ogive that illustrates data above certain values.00 80. such that the y values increase from left to right.00 0. but n (n Ogive using relative data No. The usefulness of these graphs is that interpretations can be easily made.00 99.00 11. which shows the cumulated data in an absolute form and a relative form.00 6.000 155.3 Class limit.00 12.00 94.000 65. such that the resulting table indicates how many.00 89.00 100. The relative frequency ogives.00 77.000 175.00 58.00 69.00 1.

17 18 5.0 .0 . 5. First the 14 16 . 100 90 80 70 Percentage (%) 60 50 40 30 20 10 0 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 00 5. 5. A stem-and-leaf display shows all individual data entries whereas a frequency distribution groups data into class ranges. for example.0 . we would need to know to what base we are referring. 00 0 10 11 15 12 13 Sales ($) Greater than Less than see that 90. or stems and trailing digits.000. or leaves. This organizes data showing how values are distributed and cluster around the range of observations in the dataset.0 . 5.00% of the sales are no more than $145. The relative frequency ogive is probably more useful than the absolute frequency ogive as proportions or percentages are more meaningful and easily understood than absolute values. 35 45 55 65 75 85 95 5. 5.000. The ogives can also be presented as an absolute frequency ogive by indicating on the y-axis the number of data entries which lie above or below given values. 00 00 00 00 00 00 00 . In this case a sample of 200 pieces of data.10 Statistics for Business Figure 1. Here we see. Stem-and-leaf display Another way of presenting data according to the frequency of occurrence is a stem-and-leaf display.0 5.5 Relative frequency ogives of sales data. In the latter case. The display separates data entries into leading digits. This is shown for the sales data in Figure 1.4. which is the sales receipts.6. 5. that 60 of the 200 data points are sales data that are less than $85.0 . in £’000s for one particular month for 60 branches of a supermarket in the United Kingdom. Let us consider the raw data that is given in Table 1.0 .

6 9.8 10.5 8.5 14.8 14.6 12.4 Raw data of sales revenue from a supermarket (£’000s). or those values to the right of the decimal point.5.8 13. 5. 5.5 7.8 11.2 12.8 16.4 12.4 11.0 10.7 9.8 15.8 12.8 12.9 10.0 9.5 10.6 12.9 13. This gives an ordered dataset as shown in Table 1.0 .2 10.6 11.1 13.0 .0 .0 10. The stem-and-leaf display appears in Figure 1.5 11. or those values to the left of the decimal point.0 .0 16.0 .2 11.9 9. 35 45 55 65 75 85 95 5.5 13.6 14. 5.9 12.2 11.5 11. 15.2 14 13. 5.2 8.0 . Here we see that the lowest values are in the seven thousands while the highest are in the sixteen thousands. 17 18 00 00 00 00 00 00 00 .7 14.0 5.7 14.4 11. 5.1 10. The stem that has a value of 11 indicates the data that occurs most .7.1 14.Chapter 1: Presenting and organizing data 11 Figure 1. 5.5 11.0 16. For the stem and leaf we have selected the thousands as the stem.1 11.1 9.6 Absolute frequency ogives of sales data. 00 0 15 10 11 12 13 Sales ($) Greater than Less than Table 1.2 13.7 15.0 15.9 15.5 12.0 16 8.1 11. 200 180 160 140 Units of data 120 100 80 60 40 20 0 0 00 00 00 00 00 00 00 0 0 0 0 0 0 0 00 5.4 10.2 12.1 12.7 13. and the leaf as the hundreds.6 9.5 data is sorted from lowest to the highest value using the Excel command [SORT] from the menu bar Data.

1 14.1). which are those methods that give a sense or initial feel about data being studied. In the frequency distribution.7 14. The frequency distribution for the same data is shown in Figure 1. The pattern is similar to the stem-and-leaf display but the individual values are not shown.4 9.2 12. Stem 7 8 9 10 11 12 13 14 15 16 Total 1 8 5 0 0 0 0 1 1 2 0 2 8 1 2 0 1 2 2 4 0 3 9 2 4 1 1 5 4 5 1 4 5 Leaf 6 7 No.6 11. For example. and the leaf value is 75.0 9. These differences are simply because this is the way that the Another approach to develop a stem-and-leaf display is not to sort the data but to keep it in its raw form and then to indicate the leaf values in chronological order for each stem. 16. If you have no add-on stem-and-leaf display in Excel (a separate package) then the following is a way to develop the display using the basic Excel program: ● ● 6 5 2 2 6 5 6 6 6 2 4 7 8 8 8 7 5 5 8 9 8 5 5 8 9 5 6 7 7 7 9 9 ● ● Arrange all the raw data in a horizontal line.6 14.1 13.of items 8 9 10 11 1 3 6 8 11 10 7 6 5 3 60 frequency function operates in Microsoft Excel.4 13.9 8.2 13.2 11.2 16. A box and whisker plot discussed in Chapter 2 is also another technique in EDA.5 11. (Use the Excel function SORT in the menu bar Data.1 12.2 8.8 11.5 12.0 13.5 Ordered data of sales revenue from a supermarket (£’000s).6 9.9 14. Note that in the frequency distribution.8 10.7 11.0.0 11.000.8 10. if there is a value 9.0 10.1 15. For example.4 16.0 11.6 15.5 12.0 appears in the class range 10 to 11. those sales from £11.5 12.7 12. Line graph A line graph. 11. A stemand-leaf display is one of the techniques in exploratory data analysis (EDA).8 8. 16.1 10.2 14. Transpose the ordered data into their appropriate stem giving just the leaf value. This has a disadvantage in that you do not see immediately which values are being repeated.8 15. Sort the data in ascending order by line.5 13.8 11.) Select the stem values and place in a column.0 12.5 16.4 11. in the stem that has a value of 16 there are three values (16.6 11.75 then the stem is 9.6 10. in the stem-and-leaf display. 11.9 10.5 10.7 12. or usually referred to just as a graph.0 9. frequently or in this case.8 15.5 9. It illustrates the relationship between the variable .1) as 16.8 9. whereas in the frequency distribution for the class 16 to 17 there is only one value (16.1 Figure 1.9 12.8. 7.7 15. the x-axis has the range greater than the lower thousand value while the stemand-leaf display includes this value.0.2 10.7 Stem-and-leaf display for the sales revenue of a supermarket (£’000s). Alternatively.000 to less than £12.2 12.5 13.5 14.12 Statistics for Business Table 1.1 13. gives bivariate data on the x.0 appears in the stem 11 to less than 12.9 11.and y-axes.0 is not greater than 16.

6 Period 1 2 3 4 5 6 7 8 9 10 11 12 Sales data for the last 12 years. illustrating the increase in sales each year. Year 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 Sales ($‘000s) 1.389 2. Consider for example. A line graph is not necessarily a straight line but can be curvilinear. Here the slope of the graph.Chapter 1: Presenting and organizing data 13 Figure 1.10 now shows the same information except that the y-axis starts at the value of $1. If time represents part of the data this is always shown in the x-axis.8 Frequency distribution of the sales revenue of a supermarket (£). 11 10 9 Number of values in this range 8 7 6 5 4 3 2 1 0 to 10 9 9 7 6 6 7 4 1 0 10 11 12 13 14 15 16 1 to to to to to to to to to 6 7 8 9 10 11 12 13 14 15 Class limits (£) Table 1.700.9 gives the graph for this sales data where the y-axis begins at zero and the increase on the axis is in increments of $500.6 for the 12-year period from 1992 to 2003.070 on the x-axis and the corresponding value on the y-axis.000 and the 16 to 17 7 8 9 .105 2.810 2. Figure 1.665 2.415 2.940 3. Attention has to be paid to the scales on the axes as the appearance of the graph can change and decision-making can be distorted. the sales revenues given in Table 1. is moderate.775 2.000. Figure 1.500 2.213 2.480 2.000 2.

14 Statistics for Business Figure 1.000 1.500 2. 3.900 2.9 Sales data for the last 12 years for “Company A”.10 Sales data for the last 12 years for “Company B”.700 1992 1993 1994 1995 1996 1997 1998 Year 1999 2000 2001 2002 2003 2004 .500 1.100 1. 3.500 $’000s 2.000 500 0 1992 1993 1994 1995 1996 1997 1998 Year 1999 2000 2001 2002 2003 2004 Figure 1.700 $’000s 2.500 3.000 2.300 2.900 1.100 2.

business unit. the sales data of Table 1. The complete pie represents 100% of the data and the usefulness of the pie chart is that we can see clearly the pattern of the data. the United Kingdom has the greatest contribution to sales revenues. this can be converted into a pie chart according to desired categories. When you develop a pie chart you can Questionnaires Very often we use questionnaires in order to evaluate customers’ perception of service level. “Yes” or “No”.Chapter 1: Presenting and organizing data incremental increase is $200. This gives the impression that the sales growth is very rapid. or subscribers’ opinion of a publication. A presentation of this type can be important to show the strength of the firm. the house is the largest on the street. or a name. as shown in the second line. which is why the two figures are labelled “Company A” and “Company B”. This is obviously subjective information. students’ appreciation of a university course.7. For example. As an illustration. and there is additional information in Chapter 6. sales agent. a category. otherwise. if you have a category called “other” be sure that this proportion is small relative to all the other categories in the pie chart. etc.000 or 2. For example with a university course. My salary increased this year. The first line is the category of the response. Each segment of the pie is proportional to the total amount of data it represents and can be labelled accordingly. as a pie chart. increased. Then. Category Very Poor Satisfactory Good Very poor good Score 1 2 3 4 5 . and Austria the least. For example. We can give the categorical response a score. to a survey are also categorical data.8 together with the percentage amount of data for each country. The responses.9. which is then organized and given a label. He ran the Santa Monica marathon in a fast time. a firm’s sales revenues. because we want to know if we are “doing it right” and if not what changes should we make. Line graphs are treated further in Chapter 10. The analysis of this type of questionnaire is illustrated in Chapter 2. Pie chart If we have numerical data.5 times smaller than in Figure 1. They are of course the same company. product type.11. Here the categories are large. or a quantitative value for the subjective response. may be presented according to geographic region.1 has now been organized by country and this tabular information is given in Table 1. We do this Table 1. Here for example. we can analyse this data in order to obtain a reasonable opinion of say the university course. A pie chart is a circle representing the data and divided into segments like portions of a pie. A questionnaire may take the form as given in Table 1. Student A may have a very different opinion of the same programme as Student B. which are quantitative data.7 A scaled questionnaire. your audience will question what is included in this mysterious “other” slice. and fast. We can clearly see now what the data represents and the contribution from each geographical territory. is shown in Figure 1. if the number of responses is sufficiently large. Alternatively categorical data may be developed from numerical data. When you develop a pie chart for data. 15 Categorical Data Information that includes a qualitative response is categorical data and for this information there may be no quantitative data. This information.

or row. Group Country 1 2 3 4 5 6 7 8 9 10 Total Austria Belgium Finland France Germany Italy Netherlands Portugal Sweden United Kingdom Sales Percentage revenues ($) 522. and the x-axis the categories.091.13 gives the relative frequency histogram for this same information where the y-axis is now a percentage scale. p. in these histograms. a parallel or side-by-side bar chart can be developed.470.66 18.15 gives a bar chart for the sales data.533.639 2. as one category does not directly flow to another. Figure 1.266.66% Netherlands 5.1 Vertical histogram An alternative to a pie chart is to illustrate the data by a vertical histogram where the vertical bars on the y-axis show the percentage of data.32% Italy 10.54 6. Figure 1.431 2. One column.01 10.432.16 shows a 1 Economic and financial indicators. or row.829 1.54% Belgium 6.234 20. 98. or two rows of data.086.and y-axes are reversed such that the data are presented in a horizontal.92% Germany 14. Figure 1. and the adjacent column.61 12.01% Portugal 5.333 2.11 Pie chart for sales. From this graph we can compare the change from one period to another.00 Figure 1.12 gives an absolute histogram of the above pie chart sales information where the vertical bars show the absolute total sales and the x-axis has now been given a category according to geographic region.161. rather than a vertical format. is the category. as is the case of a histogram of a complete numerically based frequency distribution. Gantt (1861–1919). Parallel histogram A parallel or side-by-side histogram is useful to compare categorical data often of different time periods as illustrated in Figure 1.054 741. Note.876. .03% Sweden 18. Figure 1.884.14. The figure shows the unemployment rate by country for two different years. is the numerical data.779 1. Horizontal bar chart A horizontal bar chart is a type of histogram where the x.32 5. The graphics capability in Excel does this automatically. Note that in developing a pie chart in Excel you do not have to determine the percentage amount in the table.59% Finland 3.566 4.17 3.16% only have two columns.479 3.17% UK 21. Parallel bar chart Again like the histogram.61% France 12.8 Raw sales data according to country ($).257 2. 15 February 2003.59 100.16 5.92 21.03 14.16 Statistics for Business Table 1.065 1. Horizontal bar charts are sometimes referred to as Gantt charts after the American engineer Henry L. the bars are separated. Austria 2. The Economist.

000 2. The Pareto diagram is a visual chart used often in quality management and operations auditing as it shows those categorical areas that are the most critical and perhaps should be dealt with first. meaning that all possible problems are included. The x-axis gives the categories and the left-hand y-axis is the percent frequency of occurrence according to each of these categories with the vertical bars indicating their magnitude. that occur in the distribution by truck of a chemical product.000.000. then the line graph increases to 100% as shown.400. This type of presentation is known as a Pareto diagram. according to categories.000 4.600.12 Histogram of sales – absolute revenues.000.000 3. (named after the Italian economist.000 2.400. This illustrates the problems. who is also known for coining the 80/20 rule often used in business).000 2.800.000 4.600.000 1. The line graph that is shown now uses the right-hand y-axis and the same x-axis. If we assume that the categories shown are exhaustive.000 1.000 4.200.000 200. Vilfredo Pareto (1848–1923).400.000 600.Chapter 1: Presenting and organizing data 17 Figure 1. Pareto diagram Another way of presenting data is to combine a line graph with a categorical histogram as shown in Figure 1.14.000 1.600.000 3.000 1.400.000.600.200.800.000 2.17.000 1.000 3.200. Usually the presentation is illustrated so that the bars are in descending order from the most important on the left to the least important on the right so that we have an organized picture of our situation.000 2.000 0 ly ria ce an an nd ga en m an st Ita iu Au Fr rla rtu lg nl m Sw ed U K y d s l Po Revenues ($) Be er Fi G Country side-by-side bar chart for the unemployment data of Figure 1.000 4.800.000 3. This is now the cumulative frequency of occurrence of each category.800.200.000 800.000 3. 4. N et he .000 400.

24 22 20 18 Total revenues (%) 16 14 12 10 8 6 4 2 0 d ce y s l en ria m ly an an nd ga an st Ita iu Au Fr rla rtu lg nl m he Po Sw Be er Fi ed U K U U SA G Country Figure 1.0 10.0 7.0 Percentage rate 8.0 4.0 Au st ri Be a lg iu m C an ad a D en m ar Eu k ro zo ne Fr an ce G er m an lia pa n ai n Au st ra ed en itz er la nd Sw ly Ita nd s er la Ja Sp Sw N et Country End of 2002 End of 2001 N et h K .0 6.0 5.13 Histogram of sales as a percentage.0 1.0 9. 13.0 2.18 Statistics for Business Figure 1.0 11.0 0.0 12.0 3.14 Unemployment rate.

500.000 Sales revenues ($) Figure 1.000 1.000.500.000 4. USA UK Switzerland Sweden Spain Netherlands Japan Country Italy Germany France Euro zone Denmark Canada Belgium Austria Australia 0 1 2 3 4 5 6 7 8 Unemployment rate (%) End of 2002 End of 2001 9 10 11 12 13 .000 2.500.15 Bar chart for sales revenues.000 3.000 3. UK Sweden Portugal Netherlands Country Italy Germany France Finland Belgium Austria 0 500.000.000 1.000.000 2.000.Chapter 1: Presenting and organizing data 19 Figure 1.16 Unemployment rate.

550 people in the United States and their professions according to certain states. for example.) Stacked histogram Once you have developed the cross-classification table you can present this visually by developing a stacked histogram.9 according the state of employment. Portions of the histogram indicate the profession.19 gives a stacked histogram for the same table but now according to profession. Portions of the histogram now give the state of residence. 45 40 35 Frequency of occurrence (%) 30 25 50 20 40 15 10 5 0 ed w ng d ed ge g d le ke lin lo ng st ag an ro be ea ac ro m ch ea o ru w w th er 100 90 80 Cumulative frequency (%) 70 60 30 20 10 0 to ts la st da re s er no tio m le ct ly du tu ru rd rre or m ta s po m he pe co ru ru en m um Sc In D ts lle Te oc Pa Reasons for poor service Individual Cumulative Cross-classification or contingency table A cross-classification or contingency table is a way to present data when there are several variables and we are trying to indicate the relationship between one variable and another. Alternatively. Table 1.18 gives a stacked histogram for the cross-classification in Table 1. As an illustration.9 gives a cross-classification table for a sample of 1. we can say that 24 of the residents of South Dakota are contingent of working for the government. Figure 1. Figure 1. D D el ay D –b ra O D ad s w s n . that 51 of the teachers are contingent of residing in Vermont. Alternatively.17 Pareto analysis for the distribution of chemicals.20 Statistics for Business Figure 1. From this table we can say. (Contingent means that values are dependent or conditioned on something else.

550 California Texas Colorado New York Vermont Michigan South Dakota Utah Nevada North Carolina Total Figure 1.Chapter 1: Presenting and organizing data 21 Table 1.9 State Cross-classification or contingency table for professions in the United States. 240 220 200 180 Number in sample 160 140 120 100 80 60 40 20 0 California Texas Colorado New York Vermont Michigan South Dakota State Agriculture Government Banking Teaching Engineering Utah Nevada North Carolina .18 Stacked histogram by state in the United States. Engineering 20 34 42 43 12 24 34 61 12 6 288 Teaching 19 62 32 40 51 16 35 25 32 62 374 Banking 12 15 23 23 37 15 12 19 18 14 188 Government 23 51 42 35 25 16 24 29 31 41 317 Agriculture 23 65 26 54 46 35 25 61 23 25 383 Total 97 227 165 195 171 106 130 195 116 148 1.

22 Statistics for Business Figure 1.19 Stacked histogram by profession in the United States. Number in sample The value of your money today The value of your money tomorrow .20 A pictogram to illustrate inflation. 450 400 350 300 250 200 150 100 50 0 Engineering Teaching Banking Profession North Carolina Vermont Nevada New York Utah Colorado South Dakota Texas Michigan California Government Agriculture Figure 1.

For example. icon. a tool in EDA.20 gives an example of how inflation might be represented by showing a large sack of money for today. Although we use the term line graph. Chapter Summary Numerical data Numerical data is most often univariate. Univariate data can be converted into a frequency distribution that groups the data into classes according to the frequency of occurrence of values within a given class. the display does not have to be a straight line but it can be curvilinear or simply a line that is not straight! Categorical data Categorical data is information that includes qualitative or non-quantitative groupings. A frequency distribution can be simply in tabular form. histogram. In statistical analysis a common tool using categorical responses is the questionnaire. The polygon. An extension of the frequency distribution is the less than. or alternatively. or Newsweek make heavy use of pictograms. where respondents are asked . This chapter has presented several tools useful for presenting data in a concise manner with the objective of clearly getting your message across to an audience. either in absolute or relative form. A histogram can be converted into a frequency polygon which links the midrange of each of the classes. or profits. and leaves. or data with a single variable. Magazines such as Business Week. The commonly used line graph is a graphical presentation of bivariate data correlating the x variable with its y variable. qualitative. The advantage of a graphical display is that you see clearly the quantity. or bivariate which is information that has two related variables. and capital expenditures.Chapter 1: Presenting and organizing data 23 Pictograms A pictogram is a picture. or proportion of information. has our money been reduced by a factor of 50%. 100%. operating costs. Numerical data can be represented in a categorical form where parts of the numerical values are put into a category such as product type or geographic location. The usefulness of ogive presentations is that it is visually apparent the amount. A stem-and-leaf display. or greater than ogive. that is more or less than certain values and may be indicators of performance. Time. or worst. Pictograms are not covered further in this textbook. Figure 1. or comparative manner. or leading values. For example in the figure given. gives the pattern of the data in a continuous form showing where major frequencies occur. The chapter is divided into discussing numerical and categorical data. or 200%? We cannot say clearly. costs. that appear in defined classes. is a frequency distribution where all data values are displayed according to stems. Pictograph is another term often employed for pictogram. revenues. a coin might be shown divided into sections indicating that portion of sales revenues that go to taxes. Attention must be made when using pictograms as they can easily distort the real facts of the data. or trailing values of the data. or sketch that represents quantitative data but in a categorical. This can illustrate key information such as the level of your best. it can be presented graphically as an absolute. profits. or relative frequency. or percentage. and a smaller sack for tomorrow.

In this way a comparison of changes can be made within named categories. and the x-axis the amount or proportion of data within that label. The pie chart is a circle where portions of the “pie” are usually named categories and a percentage of the complete data. No further discussion of pictograms is given in this textbook.24 Statistics for Business opinions about a subject. which are pictorial representations of information. Whether to use a vertical histogram or a horizontal bar chart is really a matter of personal preference. The cross-classification table can be converted into a stacked histogram according the desired categories. These are often used in newspapers and magazines to represent situations but they are difficult to rigorously analyse and can lead to misrepresentation of information. When data falls into several categories the information can be represented in a crossclassification or contingency table. A visual tool often used in auditing or quality control is the Pareto diagram. this chapter mentions pictograms. A pie chart is a common visual representation of categorical data. A vertical histogram can also be used to illustrate categorical data where the x scale has a name. Finally. or label. a questionnaire can be easily analysed. . The whole circle is 100% of the data. This table indicates the amount of data within defined categories. and the y-axis is the amount or proportion of the data within that label. which is a useful graphical presentation of the various classifications. This is a combination of vertical bars showing the frequency of occurrence of data according to given categories and a line graph indicating the accumulation of the data to 100%. The vertical histogram can be shown as a horizontal bar chart where it is now the y-axis that has the name. If we give the categorical response a numerical score. or label. Similarly the horizontal bar chart can be shown as a parallel bar chart where now each label contains data say for two or more periods. The vertical histogram can also be shown as a parallel or side-by-side histogram where now each label contains data say for two or more periods.

4 5. The average . 3. 8.6 9.3 10.6 11.5 Required 1.1 9.7 9. 5.3 12. After examining the data presented in the figure from Question No.7 11.5 10. determine from the appropriate ogive how many of the Hardway stores Carrefour would purchase? 2. 6. Carrefour management decides that it will purchase only those stores showing profits greater than £12.5 10.2 8.5 9.1 6. The following are the revenues. Buyout – Part I Situation Carrefour.9 6.3 13.5 10.4 9. for one particular month. Compare this to the absolute frequency histogram.9 9. Illustrate this information as a closed-ended absolute frequency histogram using class ranges of £1.1 12.5 14.8 8. 1.8 10.2 10.5 7.9 9.Chapter 1: Presenting and organizing data 25 EXERCISE PROBLEMS 1.1 11. Closure Situation A major United States consulting company has 60 offices worldwide.6 11.000 and logical minimum and maximum values for the data rounded to the nearest thousand pounds.7 8. are as follows. Convert the absolute frequency histogram developed in Question 1 into a relative frequency histogram. 2.3 11.6 8.5 11. France.5 12.500.1 12.6 10. in million dollars.3 9. a grocery chain in the Greater London area of the United Kingdom.8 11. On this basis.7 10.5 7.3 12. in £’000s. 4.7 11.7 10.7 13. Develop a stem-and-leaf display for the data using the thousands for the stem and the hundreds for the leaf. Illustrate this data as a greater than and a less than ogive using both absolute and relative frequency values.6 10.5 7. is considering purchasing the total 50 retail stores belonging to Hardway.1 11.3 10.2 15.8 7. for each of the offices for the last fiscal year. The profits from these 50 stores. Convert the relative frequency histogram developed in Question 2 into a relative frequency polygon.7 12.

4. estimate the new average margin per office. From the distribution you have developed in Question 3.120 37.258 54. Present on the appropriate frequency distribution (ogive).120 41.257 28.050 39.250 38.965 46.990 35. 2.730 34. What is the average margin per office for the consulting firm before any closure? 3.690 50.850 41.920 38.410 38.150 33.250 47.532 37.440 32. management is considering closing those offices whose annual revenues are less than the average operating cost.352 38. the number of offices having less than certain revenues. .260 27. ● Use a range of $5 million.26 Statistics for Business annual operating cost per office for these.890 31.010 39. including salaries and all operating expenses.653 24. so they can understand the impact of their proposed decision.750 69.790 41.235 59.110 46.258 34.070 42. ● Maximum on the revenue distribution is rounded to the closest multiple of $10 million.590 46. Required In order to present the data to management. is $36 million.250 34. how many offices have revenues lower than $36 million and thus risk being closed? 5.590 60.080 20.210 20.740 59.850 53. To construct the distribution use the following criterion: ● Minimum on the revenue distribution is rounded to the closet multiple of $10 million.410 41.653 35.820 36.070 24.190 39.660 54.110 50.365 46. develop the following information.520 42.658 31. Present the revenue data as a closed-end absolute frequency distribution using logical lower and upper limits rounded to the nearest multiple $10 million and a class limit range of $5 million.543 58.030 35.584 62. If management makes the decision to close that number of offices determined in Question 3 above.210 33.040 42.650 42.125 25. 1.070 46.324 29.220 42. 49.564 As a result of intense competition from other consulting firms and declining markets.550 27.250 29.920 37.690 43.

2.019 630 692 609 798 823 650 776 712 651 952 729 825 791 830 878 507 769 780 871 732 539 565 926 843 795 794 778 763 773 743 759 968 658 869 821 940 903 993 761 764 919 861 580 620 796 560 709 826 790 847 763 779 682 610 669 852 825 751 1. 5. Develop an absolute value closed-limit frequency distribution table using a data range of 50 attendances and. 3. What is the proportion of the attendance at the swimming pool that is 750 and 800 people? 6. The community is considering building a restaurant facility in the swimming pool area but before a final decision is made. 7. Convert the relative frequency distribution into a greater than and less than ogive and plot these two line graphs on the same axis. Convert this data into an absolute value histogram.004 915 744 724 811 895 621 709 743 808 810 728 792 883 680 880 748 806 619 Required 1.Chapter 1: Presenting and organizing data 27 3. (a) If the probability of more than 900 people coming to the swimming pool was at least 10% or the probability of less than 600 people coming to the swimming . Swimming pool Situation A local community has a heated swimming pool. The community leaders came up with the following three alternatives regarding providing the capital investment for the restaurant. which is open to the public each year from May 17 until September 13. Convert the absolute frequency histogram into a relative frequency histogram. to the nearest hundred. Plot the relative frequency distribution histogram as a polygon. Develop a stem-and-leaf display for the data using the hundreds for the stem and the tens for the leaves. it wants to have assurance that the receipts from the attendance at the swimming pool will help finance the construction and operation of the restaurant. What are your observations about this polygon? 4. a logical lower and upper limit for the data. 869 678 835 845 791 870 848 699 930 669 822 609 755 1. Respond to these using the ogive data. In order to give some justification to its decision the community noted the attendance each day for one particular year and this information is given below.088 750 931 901 726 678 672 582 716 749 685 790 785 835 869 837 745 690 829 748 980 860 707 907 830 956 878 755 874 1.

67 32.70 25.33 26. Approximately what proportion of the time will the canal authorities be collecting at least €105 from boats passing through the canal? 5. Purchasing expenditures Situation The complete daily purchasing expenditures for a large resort hotel for the last 200 days in Euros are given in the table below.80 22. 4.50 32.70 25.00 29.80 22. Rhine river Situation On a certain lock gate on the Rhine river there is a toll charge for all boats over 20 m in length.00 29.20 16. water for the .00 Required 1.10 18.33 19. In a certain period the following were the lengths of boats passing through the lock gate.00 26.00 32.70 32.00 17.25 26.00 21. Under these criteria would the community fund the restaurant? Quantify your answer both in terms of the 10% limits and the attendance values.50 23. beverages. and nonfood items for the five restaurants in the complex.28 Statistics for Business pool was not less than 10%. From the appropriate ogive approximately what proportion of the boats will not have to pay any toll fee? 4. The purchases include all food.00 23. Draw the ogives for this data using a logical maximum and minimum value for the limits to the nearest even number of metres.00 28.67 32. Show this information in a stem-and-leaf display. 3.80 20. It also includes energy.50 25.00 27.80 18. (b) If the probability of more than 900 people coming to the swimming pool was at least 10% and the probability of less than 600 people coming to the swimming pool was not less than 10%.00/m for every metre above the minimum value of 20 m.80 18.00 28.50 19. 22.00 17.00 20.70 32.70 18.33 30.20 25.50 37.50 21.50 36. (c) If the probability of between 600 and 900 people coming to the swimming pool was at least 80%.00 23.90 25.80 20. Quantify your answer.00 32.20 25. 2. Under these criteria would the community fund the restaurant? Quantify your answer both in terms of the 10% limits and the attendance values.50 23.00 24.33 30.00 20.70 18.00 23.70 24.00 26.33 19.00 31. The charge is €15.00 17.

From these ogives.860 274.741 119.525 224.787 179.173 188.680 197.411 94.956 198.000 to €200. 3. This is a line graph connecting the midpoints of each class in the dataset. gasoline for the courtesy vehicles.377 106.076 120.157 295.230 139.346 263.577 157.573 86.000? 4.622 187. gardening and landscaping services.230 221.276 144.137 137.257 188.Chapter 1: Presenting and organizing data 29 three swimming pools.775 179.177 112.173 230.340 224. Develop an absolute frequency polygon of the data.777 213.523 212.230 156.977 157.211 175.135 102.696 102.421 259.186 225.124 97. 7.811 185.862 132.973 144.355 288.746 219.577 269.755 242. to give the lower limit. rounded down to the nearest €10.466 118.651 182. Develop a relative frequency “more than” and “less than” ogive from the dataset.777 108. This histogram will be a closed-limit absolute frequency distribution.256 141.075 237.651 161.880 125.283 177.682 249.375 108.211 262.262 210.213 134.275 153.773 156.075 154.409 136. What is the percentage of purchasing expenditures in the range €180.577 224.880 307.251 175.076 275.568 90.015 205. and the minimum value.157 180.777 180.536 166.860 190.624 188.864 293.076 124.377 130.141 198. Develop an absolute frequency histogram for this data using the maximum value.212 161.076 99.978 253. From these ogives.336 203.215 238.221 254.466 194.157 274.802 130. From the absolute frequency information develop a relative frequency distribution of sales.676 231.415 132.876 217. 70% of the purchasing expenditures are greater than what amount? .024 332.157 187. 6.000.860 246.777 125.826 173.773 Required 1.651 148.977 298.496 194.424 251.240 182.226 246.571 163.173 223.211 181.531 171.898 218.077 257.998 163.251 241.215 168.496 159.651 190.377 126.624 203.612 68.886 187.621 173.977 139. rounded up to the nearest €10.171 134.876 140.923 165. Use an interval or class width of €20.412 282.173 238.476 242.626 141. 2.536 110.138 222.676 141.540 182.373 167.973 173.373 220.415 142.124 204.833 223.211 114.524 192.336 159.524 303.285 297.000. Develop an absolute frequency “more than” and “less than” ogive from the dataset.741 150.101 152.462 161.875 179.609 168.840 206.615 124.162 215.564 217.430 244.173 273. What is the quantity of data in the highest frequency? 5.777 106.187 194.249 270. 63.157 295. what is an estimate of the percentage of purchasing expenditures less than €250.973 165.766 106.415 242.155 137.876 154.476 241.480 192.411 185.880 148. to give the upper limit of the data.266 195.240 291.141 260.256 81.475 217.124 185.466 116. which is a purchased service.426 249.724 113. laundry.866 170.276 86.880 157.011 146.476 172.372 177.415 127.320 235.276 233.146 122.849 191.377 155.741 115.336 201.382 228.533 128.613 195.212 152.124 128.000? 8.856 147.155 147.000.936 208.937 332.957 183.677 146.731 151.573 187.324 161.613 197.377 175.175 248.

European sales Situation The table below gives the monthly profits in Euros for restaurants of a certain chain in Europe.28 0. divide the data for Japan by 100 and those for Denmark and Sweden by 10.00 6. .89 1.25 1.71 0.77 104. (Note in order to obtain a graph which is more equitable.657 429. Exchange rates Situation The table on next page gives the exchange rates in currency units per $US for two periods in 2004 and 2005.86 119.598 325.987 102.37 0. The Economist.39 0.58 1.19 6.481 136. p.857 256. 19 November 2005.659 225.) 2.19 5.789 1. What are your conclusions from this bar chart? 7. 101. Construct a parallel bar chart for this data.796 2 Economic and financial indicators.00 8.33 16 November 2004 1.54 1.2 16 November 2005 Australia Britain Canada Denmark Euro area Japan Sweden Switzerland 1.697 123. Country Denmark England Germany Ireland Netherlands Norway Poland Portugal Czech Republic Spain Profits ($) 985.17 Required 1.274.654 995.30 Statistics for Business 6.

Which are the three countries that have the lowest contribution to profits and what is their total contribution? 8. 2. .3 Country Argentina Armenia Belgium Brazil Britain Bulgaria Canada China Czech Republic Finland France Germany Hungary India Iran Japan Lithuania Mexico Netherlands North Korea Pakistan Romania Russia Slovakia Slovenia South Africa No. Develop a bar chart for this information in terms of absolute profits and percentage profits. Nuclear power Situation The table below gives the nuclear reactors in use or in construction according to country. 4. Develop a histogram for this information in terms of absolute profits and percentage profits. of nuclear reactors 3 1 7 2 27 4 16 11 6 4 59 18 4 22 2 56 2 2 1 1 2 2 33 8 1 2 Region South America Eastern Europe Western Europe South America Western Europe Eastern Europe North America Far East Eastern Europe Western Europe Western Europe Western Europe Eastern Europe ME and South Asia ME and South Asia Far East Eastern Europe North America Western Europe Far East ME and South Asia Eastern Europe Eastern Europe Eastern Europe Eastern Europe Africa (Continued) 3 International Herald Tribune. Develop a pie chart for this information. 18 October 2004.Chapter 1: Presenting and organizing data 31 Required 1. What are the three best performing countries and what is their total contribution to the total profits given? 5. 3.

Textbook sales Situation The sales of an author’s textbook in one particular year were according to the following table. Country Australia Austria Belgium Botswana Canada China Denmark Egypt Eire England Finland France Germany Greece Hong Kong India Iran Israel Italy Sales (units) 660 4 61 3 147 5 189 10 25 1.632 11 523 28 5 2 17 17 4 26 Country Mexico Northern Ireland Netherlands New Zealand Nigeria Norway Pakistan Poland Romania South Africa South Korea Saudi Arabia Scotland Serbia Singapore Slovenia Spain Sri Lanka Sweden Sales (units) 10 69 43 28 3 78 10 4 3 62 1 1 10 1 362 4 16 2 162 . Develop a bar chart for this information by country sorted by the number of reactors. Develop a pie chart for this information according to country for the region that has the highest proportion of nuclear reactors. 3. 2. Which three countries have the highest number of nuclear reactors? 5. Develop a pie chart for this information according to the region. of nuclear reactors 20 9 11 5 17 104 Region Far East Western Europe Western Europe Western Europe Eastern Europe North America Required 1. 4. No. Which region has the highest proportion of nuclear reactors and dominated by which country? 9.32 Statistics for Business Country South Korea Spain Sweden Switzerland Ukraine United States ME: Middle East.

sorting the data from the country in which the units sold were the highest to the lowest. Develop a histogram for absolute book sales by countries in the European Union from the highest to the lowest. converted to $US. What are your comments about this data? 10. and the like. for persons working in textile manufacturing. 1. Develop a histogram for this data by country and by units sold. 27 September 2005. What is your criticism of this visual presentation? 2. Develop a pie chart for book sales by countries in the European Union. Develop a histogram for absolute book sales by continent from the highest to the lowest.4 Country Bulgaria China (mainland) Egypt France Italy Slovakia Turkey United States 4 Wage rate ($US/hour) 1. Which country has the highest book sales as a proportion of total in Europe? Which country has the lowest sales? 5. p. This includes social charges. . Which continent has the highest percentage of sales? Which continent has the lowest book sales? 3.88 19.27 3. Textile wages Situation The table below gives the wage rates by country.78 Wall street Journal Europe. medical benefits. The wage rate includes all the mandatory charges which have to be paid by the employer for the employees benefit. 4. 6.14 0. Develop a pie chart for book sales by continent. vacation.63 3.Chapter 1: Presenting and organizing data 33 Country Japan Jordan Latvia Lebanon Lithuania Luxemburg Malaysia Sales (units) 21 3 1 123 1 69 2 Country Switzerland Taiwan Thailand UAE Wales Zimbabwe Sales (units) 59 938 2 2 135 3 Required 1.05 15.49 0.82 18.

625 33. builders. applied to work (May 2004–June 2005) 62. 2.480 6. The following table gives the statistics for those new arrivals from Eastern Europe since May 2004. hundreds of East Europeans have moved to Britain to work. 1. Europe’s great migration: Britain absorbing influx from the East. T.34 Statistics for Business Required 1.000 a month.000 Fuller.470 250 Percentage in range 42.290 24. Determine the wage rate of a country relative to the wage rate in China. farmhands.000 4.000 30. 3. Immigration to Britain Situation Nearly a year and a half after the expansion of the European Union. Develop a bar chart for this information. What are your conclusions from this data that you have presented? 11. pp. The immigrants work as bus drivers.0 40.400 9. 21 October 2005. Poles.900 16.5 Nationality of applicant Czech Republic Estonia Hungary Latvia Lithuania Poland Slovakia Slovenia Age range of applicant 18–24 25–34 35–44 45–54 55–64 Employment sector of applicant Administration. and management Agriculture Construction Entertainment and leisure 5 Registered to work 14. waitresses.0 11. 4. . dentists. Lithuanians Latvians and others are arriving at an average rate of 16.0 No.755 131. Plot on a combined histogram and line graph the sorted wage rate of the country as a histogram and a line graph for the data that you have calculated in Question 2.0 6. and sales persons.0 1. as a result of Britain’s decision to allow unlimited access to the citizens of the eight East Europeans that joined the European Union in 2004. business.. Show the information sorted. International Herald Tribune.610 3. 4.

applied to work (May 2004–June 2005) 11. Develop a bar chart for the data in the given alphabetical order. 3. What are your conclusions from the charts that you have developed? 12. Transpose the information from Question 1 into a pie chart. Develop a pie chart for the age range of the applicant and the percentage in this range. Develop a bar chart for the employment sector of the immigrant and those registered for employment in this sector. Develop a pie chart for the data and show on this the country and the percentage of pill consumption based on the information provided.500 Required 1. 25 February 2004. Develop a bar chart of the nationality of the immigrant and the number who have registered to work.500 7. 2. .200 19.000 9. 5. 6 Wall Street Journal Europe.500 9. 2.6 Country Canada France Italy Japan Spain United Kingdom USA Pills consumed per 1.Chapter 1: Presenting and organizing data 35 Employment sector of applicant Food processing Health care Hospitality and catering Manufacturing Retail Transport Others No.000 people 66 78 40 40 64 36 53 Required 1.000 people in certain selected countries.000 10.000 53. 4. Pill popping Situation The table below gives the number of pills taken per 1.

36 Statistics for Business 3. The Economist. Electoral College Situation In the United States for the presidential elections.7 Also included is how the state voted in the 2004 United States presidential elections. Which country consumes the highest percentage of pills and what is this percentage amount to the nearest whole number? 4. 2 November 2004. people vote for a president in their state of residency. A12. The following gives the electoral college votes for each of the 50 states of the United States plus the District of Colombia. . p. p. 23. 6 November 2004.8 State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi 7 8 Electoral college votes 9 3 10 6 55 9 7 3 3 27 15 4 4 21 11 7 6 8 9 4 10 12 17 10 6 Voted to Bush Bush Bush Bush Kerry Bush Kerry Kerry Kerry Bush Bush Kerry Bush Kerry Bush Bush Bush Bush Bush Kerry Kerry Kerry Kerry Kerry Bush Wall Street Journal Europe. How would you describe the consumption of pills in France compared to that in the United Kingdom? 13. Each state has a certain number of electoral college votes according to the population of the state and it is the tally of these electoral college votes which determines who will be the next United States president.

Chemical delivery Situation A chemical company is concerned about the quality of its chemical products that are delivered in drums to its clients. Develop a pie chart of the percentage of electoral college votes for each state.Chapter 1: Presenting and organizing data 37 State Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming Electoral college votes 11 3 5 5 4 15 5 31 15 3 20 7 7 21 4 8 3 11 34 5 3 13 11 5 10 3 Voted to Bush Bush Bush Bush Kerry Kerry Bush Kerry Bush Bush Bush Bush Kerry Kerry Kerry Bush Bush Bush Bush Bush Kerry Bush Kerry Bush Kerry Bush Required 1. 4. 3. Which state has the highest percentage of electoral votes and what is the percentage of the total electoral college votes? 5. What is the percentage of states including the District of Columbia that voted for Kerry? 14. Over a 6-month period it used a student intern to . 2. Develop a histogram of the percentage of electoral college votes for each state. How were the electoral college votes divided between Bush and Kerry? Show this on a pie chart.

2. Reason Delay – bad weather Documentation wrong Drums damaged Drums incorrectly sealed Drums rusted Incorrect labelling Orders wrong Pallets poorly stacked Schedule change Temperature too low No. The column “reason” in the table is considered exhaustive. Fruit distribution Situation A fruit wholesaler was receiving complaints from retail outlets on the quality of fresh fruit that was delivered. of occurrences in 6-months 70 100 150 3 22 7 11 50 35 18 Required 1. What is the problem that happens most often and what is the percentage of occurrence? This is the problem area that you would probably tackle first. In order to monitor the situation the wholesaler employed a student to rigorously take note of the problem areas and to record the number of times these problems occurred over a 3-month period. Construct a Pareto curve for this information. 3. The column “reasons” in the table is considered exhaustive. Reason Bacteria on some fruit Boxes badly loaded Boxes damaged Client documentation incorrect Fruit not clean Fruit squashed Fruit too ripe No. of occurrences in 3 months 9 62 17 23 25 74 14 . Which are the four problem areas that constitute almost 80% of the quality problems in delivery? 15. The following table gives the recorded information over the 6-month period.38 Statistics for Business measure quantitatively the number of problems that occurred in the delivery process. The following table gives the recorded information over the 3-month period.

International Herald Tribune. at the end of the season. 2005. and Cooper. for most of Europe. Invincible Press. J. C. the 20 clubs that make up the Barclay’s Bank sponsored English Premiership league. These prize amounts are indicated in Table 1 for the 2004–2005 season. Construct a Pareto curve for this information. 11 News of the World Football Annual 2005–2006. If you did not know better. 2.11 9 Aizlewood. The best players command salaries of £100. The Sunday Times. The game results are given in Table 2 and the final league results in Table 3 and from these you can determine the amount that was awarded to each club.000 a week excluding endorsements. football or soccer is their passion. 28 August 2005.. What is the problem that happens most often and what is the percentage of occurrence? Is this the problem area that you would tackle first? 3. but the left back had featured in 42 league cup and play off matches since reluctantly leaving Charlton Athletic the previous September. p.9 For many people in England. you might have suspected the engaging 35-year old was a decade younger. According to the accountants. 10 .6 billion) in the 2003–2004 season. the clubs themselves are awarded price money depending on their position on the league tables at the end of the year. T. 11. the young and the not-so-young. Case: Soccer Situation When an exhausted Chris Powell trudged off the Millennium stadium pitch on the afternoon of 30 May 2005. Business and the Beautiful Game. 19 (Book review on soccer). had total revenues of almost £2 billion ($3. 1–2 October 2005.Chapter 1: Presenting and organizing data 39 Labelling wrong Orders not conforming Route directions poor 11 6 30 Required 1. Theobald. and in fact. faithfully go and see their home team play. Football in England is a huge business. What are the problem areas that cumulatively constitute about 80% of the quality problems in delivery of the fresh fruit? 16.10 In addition.. an imprint of Harper Collins. Every Saturday many people. Kogan Page. Not only had he helped West Ham claw their way back into the Premiership league for the 2005–2006 season. he could not have been forgiven for feeling pleased with himself. p. Powell back at happy valley. It had been a good season since opposition right-wingers had been vanquished and Powell and Mathew Etherington had formed a formidable left-sided partnership. Deloitte and Touche. the most watched and profitable league in Europe.

000 9.000 5.000 1.000 8.000 7.370.000 4.000 8.500.170.320.700.000 2.000 3.120.420.000 950.000 5.000 475.220.000 1.000 2.070.900.020. How could you put this in a visual form to present the information to a broad audience? Table 2 Club Games played 38 38 38 38 38 38 38 38 38 38 38 38 38 38 Home Win Draw Lost 13 8 8 5 9 8 14 6 12 8 12 8 12 9 5 6 6 8 5 4 5 5 2 4 4 6 6 6 1 5 5 6 5 7 0 8 5 7 3 5 1 4 For 54 26 24 24 25 29 35 21 24 29 31 24 31 29 Away 19 17 15 22 18 29 6 19 15 26 15 14 12 19 Win Draw 12 5 4 3 7 4 15 1 6 3 5 5 10 5 3 5 6 7 5 6 3 7 5 4 3 7 5 7 Away Lost For 4 10 10 8 7 9 1 11 8 11 11 7 4 7 33 19 16 11 24 13 37 20 21 23 21 23 27 24 Away 17 35 31 21 26 29 9 43 31 34 26 25 14 27 Arsenal Aston Villa Birmingham City Blackburn Rovers Bolton Wanderers Charlton Athletic Chelsea Crystal Palace Everton Fulham Liverpool Manchester City Manchester United Middlesbrough .850.000 Position 11 12 13 14 15 16 17 18 19 20 Prize money (£) 4.650.000 3.000 6.550.600.000 7.750.40 Statistics for Business Table 1 Position 1 2 3 4 5 6 7 8 9 10 Prize money (£) 9.000 Required These three tables give a lot of information on the premier leaguer football results for the 2004–2005 season.800.000 6.270.

Chapter 1: Presenting and organizing data 41 Club Games played 38 38 38 38 38 38 Home Win Draw Lost 7 7 8 5 9 5 7 5 4 9 5 8 5 7 7 5 5 6 For 25 29 30 30 36 17 Away 25 32 26 30 22 24 Win Draw 4 0 4 1 5 2 7 7 5 5 5 8 Away Lost For 9 12 12 13 9 10 22 13 13 15 11 19 Away 32 45 33 36 19 37 Newcastle United Norwich City Portsmouth Southampton Tottenham WBA .

42 Statistics for Business Table 3 Bolton Wanderers Charlton Athletic Blackburn Rovers Birmingham City Crystal Palace Aston Villa Arsenal Aston Villa Birmingham City Blackburn Rovers Bolton Wanderers Charlton Athletic Chelsea Crystal Palace Everton Fulham Liverpool Manchester City Manchester United Middlesbrough Newcastle United Norwich City Portsmouth Southampton Tottenham WBA – – – 1 – 3 2 – 1 0 – 1 1 – 0 1 – 3 0 – 0 1 – 1 1 0 2 0 – – – – 4 3 1 1 3 – 1 – – – 2 – 0 2 – 2 1 – 2 3 – 0 1 – 0 2 – 0 1 1 2 2 – – – – 1 1 1 0 3 – 0 1 – 2 – – – 3 – 3 1 – 1 3 – 1 1 – 1 2 – 0 1 2 0 3 – – – – 1 3 1 0 3 – 0 1 – 0 2 – 1 – – – 0 – 1 1 – 0 4 – 0 0 – 0 0 0 0 1 – – – – 1 2 0 1 2 – 2 1 – 1 1 – 2 0 – 1 – – – 1 – 2 2 – 2 0 – 1 3 2 1 0 – – – – 2 0 0 1 3 – 0 4 – 1 5 – 2 6 – 3 0 – 1 – – – 4 – 0 0 – 0 0 0 0 1 – – – – 1 2 0 1 2 – 2 0 – 0 0 – 1 0 – 1 0 – 2 0 – 4 – – – 0 – 2 0 1 0 1 – – – – 1 4 1 0 5 – 1 1 – 1 0 – 1 1 – 0 1 – 0 2 2 7 – 0 1 – 3 0 – 1 0 – 0 3 – 2 2 – 0 1 – 0 1 – 3 – 2 2 0 – – – – – 0 1 1 4 – 1 – – – 4 3 3 3 – – – – 0 1 2 1 2 – 0 0 – 1 0 – 1 1 0 1 4 0 – – – – – 4 1 1 5 2 3 – 1 3 – 0 0 – 3 0 1 2 5 1 – – – – – 0 2 3 1 1 2 – 0 2 – 1 2 – 1 1 1 0 1 2 – – – – – 0 1 0 0 0 0 – 0 1 – 0 3 – 0 1 0 3 0 1 – – – – – 1 1 2 0 1 2 – 0 1 – 1 2 – 1 3 1 1 1 2 – – – – – 2 1 2 2 1 0 – 0 1 – 0 3 – 0 1 0 3 0 1 – – – – – 1 1 2 0 1 1 – 3 0 – 1 1 – 1 1 0 1 0 1 – – – – – 3 2 3 2 4 5 – 2 2 – 1 0 – 0 1 3 2 1 2 – – – – – 1 1 2 1 2 0 – 0 1 – 1 1 – 1 2 0 2 5 1 – – – – – 3 1 2 2 0 Everton Chelsea Arsenal .

Chapter 1: Presenting and organizing data 43 Manchester United Newcastle United Manchester City Middlesbrough Southampton Norwich City Portsmouth Tottenham Liverpool Fulham 2 – 0 2 – 0 1 – 2 1 – 3 3 – 1 2 – 1 2 – 1 2 – 0 1 – 3 1 – – – – 0 – 1 1 3 – 1 1 – 1 2 – 0 2 – 2 1 – 0 1 – 2 1 – 0 1 – 0 1 2 – 1 – – – – 0 4 – 0 1 – 1 1 – 2 1 – 0 0 – 0 0 – 1 2 – 2 0 – 0 1 – 2 2 1 2 – – – – – 1 1 1 – 2 – 4 0 – 1 0 – 0 1 – 1 2 – 2 0 – 4 1 – 0 0 – 0 1 1 0 0 – – – – 0 1 1 2 5 – 3 2 – 0 2 – 0 0 – 4 0 – 0 1 – 2 2 – 0 0 – 1 1 0 1 1 – – – – 0 2 1 1 1 – 0 4 – 2 2 – 2 2 – 2 2 – 1 1 – 1 4 – 0 0 – 2 2 1 3 1 – – – – 0 3 1 1 4 – 1 3 – 0 1 – 1 3 – 0 1 – 0 4 – 0 4 – 0 3 – 3 1 6 3 1 – – – – 0 0 0 1 3 – 0 3 – 0 0 – 0 1 – 0 0 – 1 2 – 1 3 – 0 0 – 1 2 3 1 2 – – – – 1 1 1 0 2 – 2 2 – 0 2 – 1 3 – 0 1 – 1 0 – 0 2 – 1 2 – 2 1 1 1 2 – – – – 0 0 0 1 1 – 0 1 – 0 1 – 1 0 – 1 3 – 1 2 – 0 0 – 0 3 – 0 0 2 2 0 – – – – 1 0 2 1 1 – 1 1 – 1 4 – 0 1 – 1 1 – 1 1 – 4 1 – 0 3 – 0 2 1 3 1 – – – – 1 0 0 1 1 – 0 1 – 1 1 – 4 0 4 3 2 1 – – – – – 1 3 3 0 1 2 – 1 2 – 0 1 – 0 1 1 2 1 0 – – – – – 2 2 0 1 5 0 – 0 3 – 2 4 – 3 2 1 0 2 2 – – – – – 3 3 0 1 0 – – – 0 – 2 1 – 3 2 2 1 0 0 – – – – – 0 0 2 1 3 1 – 1 – – – 0 – 0 4 2 2 2 1 – – – – – 4 1 2 0 2 2 – 1 2 – 2 – – – 2 1 1 1 0 – – – – – 1 1 2 0 0 2 – 1 2 – 0 2 – 2 – 1 4 0 0 – – – – – – 1 3 0 0 2 – 1 1 – 1 1 – 1 2 – 2 3 2 – – – – – 2 – 1 1 0 3 – 0 1 – 3 2 – 1 2 4 – 5 0 – – – – – 1 1 – 1 0 0 – 0 1 – 0 0 – 1 0 1 1 – 1 – – – – – 2 0 0 – 1 1 – 1 4 – 0 3 – 1 3 3 2 1 – – – – – – 2 2 2 1 – WBA .

This page intentionally left blank .

67. and that the range of the prices of Big Macs is $3. that half of the Big Macs are less than $2. These are some of the properties of statistical data that are covered in this chapter.51. that the cost of living in Switzerland is the highest. The Economist. Alternatively you would know that worldwide the average price of a Big Mac is $2. 1 “The Economist’s Big Mac index: Fast food and strong currencies”.40 and that half are more than $2. . Their 2005 data is given in Table 2.1 From this information you might conclude that the Euro is overvalued by 17% against the $US. and that it is cheaper to live in Malaysia.1.Characterizing and defining data 2 Fast food and currencies How do you compare the cost of living worldwide? An innovative way is to look at the prices of a McDonald’s Big Mac in various countries as The Economist has being doing since 1986.40. 11 June 2005. These are some of the characteristics of the prices of data for Big Macs.

17 2.60 1.05 2.96 1.58 3.39 3.41 1.30 4.53 2.64 2.55 3.54 2.47 1.1 Country Price of the Big Mac worldwide.27 2.34 1.17 2.48 2.50 2.53 2.44 2. Price ($US) 1.58 1.58 1.46 Statistics for Business Table 2.92 3.76 1.63 2.17 5.10 2.49 4.06 2.48 2.38 Country Mexico New Zealand Peru Philippines Poland Russia Singapore South Africa South Korea Sweden Switzerland Taiwan Thailand Turkey United States Venezuela Price ($US) 2.13 Argentina Australia Brazil Britain Canada Chile China Czech Republic Denmark Egypt Euro zone Hong Kong Hungary Indonesia Japan Malaysia .

you will learn the following characteristics. 29 April 2006. ✔ ✔ ✔ ✔ Central tendency of data • Arithmetic mean • Weighted average • Median value • Mode • Midrange • Geometric mean Dispersion of data • Range • Variance and standard deviation • Expression for the variance • Expression for the standard deviation • Determining the variance and the standard deviation • Deviations about the mean • Coefficient of variation and the standard deviation Quartiles • Boundary limit of quartiles • Properties of quartiles • Box and whisker plot • Drawing the box and whisker plot with Excel Percentiles • Development of percentiles • Division of data It is useful to characterize data as these characteristics or properties of data can be compared or benchmarked with other datasets. N. assume the salaries in Euros of five people working in the same department are as in Table 2. The total of these five values is €172. The two common general data characteristics are central tendency and dispersion. and every dataset has a mean value. They are all illustrated as follows. p.000 staff in 2005 was $520. x. Arithmetic mean The arithmetic mean or most often known as the mean or average value. is the x most common measure of central tendency. Goldman Sachs.Chapter 2: Characterizing and defining data 47 Learning objectives After you have studied this chapter you will be able to determine the properties of statistical data. The central tendency that we are most familiar with is average or mean value but there are others. Table 2. 9. the world’s leading investment bank reports that the average pay-packet of its 24.000/5 gives a mean value of €34. to compare datasets. determined by the sum of the all the values of the observations. the world’s leading investment bank epitomises the modern financial system”. to describe clearly their meaning. Susan 50.000 and 172.2. Specifically. and written by –. (On a grander scale. The Economist. and to apply these properties in decisionmaking.000 Helen 20. 40.000 2 “On top of the world – In its taste for risk.000 . x ∑x N 2(i) Central Tendency of Data The clustering of data around a central or a middle value is referred to as the central tendency. In this way decisions can be made about business situations and certain conclusions drawn.400.000 Robert 27.000 John 35.000 and that included a lot of assistants and secretaries!2) The arithmetic mean is easy to understand.2 Eric Arithmetic mean. It is For example. The mean value in a dataset can be determined by using [function AVERAGE] in Excel. The equation is. divided by the number of elements in the observations.

which translates into saying the programme is between satisfactory and good and closer to being good.000 and €34. For example in Chapter 1 we introduced a questionnaire as a method of evaluating customer satisfaction. The number of values does not necessarily influence the arithmetic mean.000 Helen 20. Is John now at a disadvantage? What is the reason that Susan received the increase? Thus.80. The average is still €34. The weighted average of the student response is given by.000 Arithmetic mean not necessarily affected by the number values. Nothing has happened to John’s situation but his salary is now below average. Note that in these different activities the hourly wage rate is different.000 Brian 34.400. assume that Susan has her salary increased to €75.3 Eric 40.800 as shown in Table 2. using the original data. you need to understand if it includes outliers and the circumstance for which the mean value is being used. Assume that a manufacturing organization uses three types of labour in the manufacture of Product 1 and Product 2 as shown in Table 2. Weighted average ∑ Number of responses * score Total responses From the table we have. (Students are the customers of the professors!) The X in each cell is the response of each student and the total responses for each category are in the last line. or weighting of each value in the overall total. Note in Excel this calculation can be performed by using [function SUMPRODUCT]. Now. John has an annual salary of €35. In the above example. Susan 50. in using average values for analysis. Another use of weighted averages is in product costing.4 is the type of questionnaire used for evaluating customer satisfaction.000 and his salary is currently above the average. Thus to calculate the correct average cost of . suppose now that Brian and Delphine join the department at respective annual salaries of €34. Here the questionnaire has the responses of 15 students regarding satisfaction of Thus using the criterion of the weighted average the central tendency of the evaluation of the university programme is 3.3. In recalculating the mean the average salary of the five increases to €39.400 per year.000 Delphine 34. and assembly before it is completed.80 Weighted average The weighted average is a measure of central tendency and is a mean value that takes into account the importance. forming. Table 2. In the above situation.000 per year. a course programme.000 John 35. In making the finished product the semi-finished components must pass through the activities of drilling. Weighted average 2*1 1*2 1*3 5* 4 15 6*5 3.000 Robert 27.48 Statistics for Business Table 2.5.800 Note that the arithmetic mean can be influenced by extreme values or outliers.

00 .25 Then if we use this hourly labour cost to determine unit product cost we would have. Product A Product B 12.75 14.25 1. $/unit is $10.50 2.00 14.25 $12.50 12.75 This is an incorrect way to determine the unit cost since we must use the contribution of each activity to determine the correct amount. Labour Labour hours/unit hours/unit Product A Product B 2.63/unit $71.5 Weighted average.50 * 1. labour cost.50 * 5.25 14.25 * 2.75 * 3.75 7. $/unit is $10. Median value The median is another measure of central tendency that divides information or data into two equal parts.25 2. Very poor 1 Poor 2 Satisfactory 3 Good 4 X X X X X X X X X X X X X X 2 1 1 5 6 1 Very good 5 X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 15 Total responses Total responses Table 2.50/hour Labour Hourly operation wage rate Drilling Forming Assembly Total $10.25 3 $12.50 $12.00 1.25 * 1.4 Category Score Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 Student 10 Student 11 Student 12 Student 13 Student 14 Student 15 Weighted average.50 $72.44 12. We come across the median when we talk about the median of a road.75 * 2.50 * 2.00 5.50 $89.50 3.Chapter 2: Characterizing and defining data 49 Table 2. weighted averages are used as follows: Product A. the hourly labour cost would be: 10. labour cost.94 12. This is the white line that divides the road into two parts such that there is the same number of lanes on Product B.88/unit labour per finished unit.50 * 7.75 $14.25 If simply the average hourly wage rate was used.75 $90.

Robert 27. Consider again the salary situation of the five people in John’s department as in Table 2. When we have quantitative data it is the middle value of the data array or the ordered set of data.000 Helen 20. 12 7 6 11 12 Table 2.000 Median value – salaries ordered. which in this case is 11.11. The number of values affects the median.7.000 then the revised information is as in Table 2.000 Susan 75. Susan 50. if Susan’s salary is increased to €75.000 one side than on the other. John still has the median salary and so that on this basis.000 Susan 50. is even.000 Eric 40. John’s salary is at the median value. there was a change. Helen 20. or the median.8. John’s salary is now below the median. when we used the average value as above. Assume Stan joins the department in the example above at the same original salary as Susan. the median is the value of 6/2 and (6 2)/2 or the linear average of the 3rd and 4th value. price of a house in a certain region is $200. When n.6 9 13 Median value – raw data.000 Robert 27.10 Median value – salaries unaffected by extreme values.000 40.8 Eric 40.10. the median is given by.000 Median value – salaries.7 6 7 Median value – ordered data. Again.9 Helen 20. Consider the dataset in Table 2.500.000 Table 2. nothing has changed for John.50 Statistics for Business Table 2. nothing has happened . In ascending order this is as in Table 2. and half below. the number of values. The value of the median is unaffected by extreme values. There is an even number of values in the dataset and now the median is (35. value is the 4th number. The salary values are thus as Table 2. the middle. For example.000 John 35. The median value is of interest as indicates that half of the data lies above the median.000 then this indicates that half of the number of houses is above $200. then the median is (7 1)/2 or the 4th value as in the above example. Ordering this data gives Table 2.000)/2 or €37. if there are seven values in the dataset. 9 11 12 12 13 Table 2.000 Robert 27. However. To determine the median value it must first be rearranged in ascending (or descending) order.000 and the other half is below.000 Eric 40.000 John 35.000 John 35. the median value is the average of the values determined from the following relationship: n 2 and (n 2 2) 2(iii) When there are 6 values in a set of data. Again. n 2 1 2(ii) Table 2. Since there are seven pieces of data.6. the number of values in a data array is odd.9. When n. if the median.000 Thus.

1. However. if we determine the median value of the sales data given in Table 1. the number of values might affect the mode.000 50. we call up [function MEDIAN] and enter the dataset.12 Mode – that value that occurs most frequently. January February March April May June July August September October November December 10 12 11 14 12 14 12 16 9 19 10 13 20.000 to John’s salary but on a comparative basis it appears that he is worse off! The median value in any size dataset can be determined by using the [function MEDIAN] in Excel.13 Mode – might be affected by the number of values. the mode is still 12. The responses were according to Table 2. The mode is unaffected by extreme values. the mode can be used for qualitative as well as for quantitative data.14. Table 2.12 are the monthly sales in $millions for the last year.000 35. in a questionnaire. We do not have to order the data or even to take into account whether there is an even or odd number of values as Excel automatically takes this into consideration.000 27. Table 2. the modal value is now $14 million since it occurs 4 times. The modal value is blue since this response occurred 3 times. people were asked to give their favourite colour. January February March April May June July August September October November December January February March 10 12 11 14 12 14 12 16 9 19 10 13 14 10 14 Mode The mode is another measure of central tendency and is that value that occurs most frequently in a dataset. For example. This type of information is useful say in the textile business when a firm is planning the preparation of new fabric or the automobile industry when the company is planning to put . It is of interest because that value that occurs most frequently is probably a response that deserves further investigation.13 over the last 15 months. The mode is 12 since it occurs 3 times.000 40. Unlike the mean and median. For this dataset the median value is 100.000 50. if the sales in January were $100 million instead of $10 million.11 Median value – number of values affects the median. For example. Thus in forecasting future sales we might conclude that there is a higher probability that sales will be $12 million in any given month. For example. For example. if we use the following sales data in Table 2. For example.296. Helen Robert John Eric Susan Stan Table 2.Chapter 2: Characterizing and defining data 51 Table 2.

000 Midrange.14 Mode can be determined for colours. The average growth rate.16 9 13 Midrange.000 Susan 75. is different for each year.000 and so John’s salary is exactly at the midrange. the midrange is. meaning that there are three.5 (product of growth rates) 2(iv) .52 Statistics for Business Table 2. the midrange can be distorted by extreme values.17 Helen 20.000 Eric 40. 12 7 6 11 12 new cars on the market. Table 2. Examples might be the growth of investments. etc.000)/2 or €47.16. which is accumulated annually. The interest rate. In Table 2. Thus.500 and John’s salary is now below the midrange. Yellow Red Green Green Blue Violet Purple Brown Rose Blue Pink Blue Table 2.000 13 17 19 7 Table 2.000 Robert 27. Helen Robert 27. The dataset in Table 2. or the change of the gross national product. four.000 to give the information in Table 2. The midrange is of interest to know where data sits compared to the midrange.17.000 75.000 in a savings account that is deposited for a period of 5 years.000 John 35.000 Eric 40. The modal value in a dataset for quantitative data can be determined by using the [function MODE] in Excel.15 9 Bi-modal.000 3 13 8 22 4 7 9 20. The midrange is (50. quad-modal. the inflation rate. bi-modal is when there are two values in a dataset that occur most frequently.000 20.000)/2 or 35. Again assume Susan’s salary is increased to €75. For example. Geometric mean The geometric mean is a measure of central tendency used when data is changing over time. 13 2 6 19 2 9.18.18 Midrange – affected by extreme values. When a dataset is bi-modal that indicates that there are two pieces of data that are of particular interest.15 is bi-modal as both the values 9 and 13 occur twice. For example. A dataset might be multi-modal when there are data values that occur equally frequently.000 Table 2.000 Susan 50. Then the midrange is (20. consider the growth of an initial investment of $1. In the salary information of Table 2. is calculated by the relationship: n Midrange The midrange is also a measure of central tendency and is the average of the smallest and largest observation in a dataset. Table 2. Data can be tri-modal.19 gives the interest and the growth of the investment. or more values that occur most frequently. John 35. or geometric mean.

Chapter 2: Characterizing and defining data 53 Table 2. as data that is more dispersed or separated is less reliable for analytical purposes.06935 $1.000 Helen 20. The number of values does not necessarily affect the range. Here the range is the difference of the salaries for Susan and Helen.232. Here the range is €125. or varies from other data values. John 35.20 which is the salary data presented earlier in Table 2.082 1.051 Value year-end $1. We have seen the use of the range in Chapter 1 in the development of frequency distributions.060 * 1.2 7. $1.000 €105.0693 1 0. For example. Another illustration is represented in Table 2. Thus. if we include in the dataset the salary of Francis.8.93% per year (1.20 Eric 40. 000 1.000 €30. .000 Range. the manager of the department.079 1. The geometrical mean can be determined by using [function GEOMEAN] in Excel applied to the growth rates.082 * 1.93%). and deposit amounts are large.94 $1.00 $1.21.075 1. we may be more interested in the variation.000. 398. the value of the $1. since variation can be a measure of inconsistency. In many situations. than in the central value.079 1. The range is affected by extreme values. variation.000. the value of the initial deposit at the end of 5 years would be.19 The same value as calculated in Table 2.075 * 1.000 Robert 27. 396. 5 1.398. For example let us say that we This is less than the amount calculated using the geometric mean.139.082 1. the difference can be significant.000 Interest rate (%) 6.000 we then have Table 2.06905 $1.0693 or 6.060.000 €20.34 $1.5 8. The following are the common measures of the dispersion of data. Range The range is the difference between the maximum and the minimum value in a dataset. or €50.19 Table 2. spread out. The difference here is small but in cases where interest rates are widely fluctuating.051 1.330.075 1. 000 * 1. $1.060 1.000 at the end of 5 years will be.1 Susan 50.000 Dispersion of Data In this case the geometric mean is. If the arithmetic average of the growth rates was used.50 $1.0690 or a growth rate slightly less of 6.9 5.90% per year.051 5 1.0693 This is an average growth rate of 6.01 Dispersion is how much data is separated. or spread.19.060 1. the mean growth rate would be: 1. It is important to know the amount of dispersion. Datasets can have different measures of dispersion or variation but may have the same measure of central tendency.079 * 1.000 €20.0 7.19 Year 1 2 3 4 5 Geometric mean. Using this mean interest rate. who has a salary of €125. Growth factor 1.

overcome the drawback of using the range as a measure of dispersion as in their calculation every value in the dataset is considered. Although both the variance and standard deviation are affected by extreme values. or outlying values. s2.000 50. In this case. –.000 27. or as follows: 2 σx ● 40. the impact is not as great as using the range since an aggregate of all the values in the dataset are considered.21 to give the dataset in Table 2. The larger the range in a dataset. The expression for the sample variance. One of the principal uses of statistics is to take a sample from the population and make estimates of the population parameters based only on the sample measurements.21 values. the negative signs are removed. is analogous to the population variance and is. Eric Susan Julie John Francis Helen Robert Expression for the variance There is a variance and standard deviation both for a population and a sample.22. N. any extreme.000 to s2 ∑ (x x )2 1) Variance and standard deviation The variance and the related measure.000 27.000 35. The population 2 variance. then the greater is the dispersion. By dividing by N gives an average value.000 125. and thus the uncertainty of the information for analytical purposes.000 37. Although we often talk about the range of data. By squaring each of the differences obtained.000 20.000 to the dataset in Table 2. the mean value μx is subtracted. denoted by σx . x. The variance is in . This indicates how far this observation is from the mean.000. Eric Range is affected by extreme Susan John Francis Helen Robert 40. or the range of this observation from the mean. μ. Table 2. Using (n 1) in the denominator reflects the fact that we have used – in the formula and so we have x lost one degree of freedom in our calculation.54 Statistics for Business squared units and measures the dispersion of a dataset around the mean value. standard deviation. For example.000 add the salary of Julie at €37.000 Table 2. We use the term “standard” in standard deviation as it represents the typical deviation for that particular dataset.000 125. is the sum of the squared difference between each observation. 2(vi) (n In the sample variance.000 20.22 Range is not necessarily affected by the number of values. The variance and particularly the standard deviation are the most often used measures of dispersion in statistics. consider you have a sum of $1. and the mean value.000 50.22.000 35. By convention when we use the symbol n it means we have taken a sample of size n from the population of size N. divided by the number of data observations.21 and 2. The standard deviation has the same units of the data under consideration and is the square root of the variation. Then the range is unchanged at €105. or x-bar the average of x the values of x replaces μx of the population variance and (n 1) replaces N the population size. the major drawback in using the range as a measure of dispersion is that it only considers two pieces of information in the dataset. can distort the measure of dispersion as is illustrated by the information in Tables 2. ∑ (x − μx )2 N 2(v) ● ● For each observation of x.

$75. ∑ (x (n x )2 1) 2(viii) For any dataset. σx.25. If we use equations 2(v) through to 2(viii) then we will obtain the population variance. It is the most often used measure of dispersion in analytical work. the logic being that it is understood that the values are calculated using the random variable x and so it is not necessary to show them with the mean and standard deviation symbols! Note that for any given dataset when you calculate the population variance it is always smaller than the sample variance since the denominator. To the sixth co-worker you have no degree of freedom of the amount to give which has to be the amount remaining from the original $1. These values and the calculation steps are shown in Table 2. That is the subscript x is dropped.000.23. $150.24. then the smaller is the dispersion which means that the data values are closer to the mean value of the dataset and that the data would be more reliable for subsequent analytical purposes. Deviation about the mean The deviation about the mean of all observations. x. about the mean value. s. 55 Table 2. N. is given by.24. and the sample standard deviation. in the population variance is greater than the value N 1. The population standard deviation. To the first five you have the freedom to give any amount say $200. then using n or (n 1) will give results that are close. which in this case is $105. is zero or x mathematically. 9 13 Variance and standard 12 7 6 11 12 Determining the variance and the standard deviation Let us consider the dataset given in Table 2. which is a summary of the final results of Table 2. is large. Similarly μ is used to denote the mean value rather than μx. Note that the expression σ is sometimes used to denote the population standard deviation rather than σx. is as follows: s s2 Population variance [function VARP] Population standard deviation [function STDEVP] Sample variance [function VAR] Sample standard deviation [function STDEV]. the closer the value of the standard deviation is to zero. n. σx 2 σx ∑ (x N μx )2 2(vii) ● ● The sample standard deviation. –. the population standard deviation. If the sample size.23 deviation.Chapter 2: Characterizing and defining data distribute to your six co-workers based on certain criteria. $210. Table 2. $260. Similarly for the same dataset the population standard deviation is always less than the calculated sample standard deviation. ∑ (x x) 0 2(ix) . When we are performing sampling experiments to estimate the population parameter with (n 1) in the denominator of the sample variance formula we have an unbiased estimate of the true population variance. illustrates this clearly. the sample variance. However with Excel it is not necessary to go through these calculations as their values can be simply determined by using the following Excel functions: ● ● Expression for the standard deviation The standard deviation is the square root of the variance and thus has the same units as the data used in the measurement.

how small is small and what about the units? If you say that the standard deviation of the total travel time.7080 Measure of dispersion Population variance Sample variance Population standard deviation Sample standard deviation Table 2. σ.5071 6 7.5071 2. if you use seconds. the standard deviation has not changed! A way to overcome the difficulty in interpreting the standard deviation is to include the value of the mean of the dataset and use the coefficient of variation. The standard deviation as a measure of dispersion on its own is not easy to interpret. N Total of values Mean value. including waiting. Variance and standard Coefficient of variation and the standard deviation Value 6.2857 7. But in any event.3333 2. μ.24 Variance and standard deviation. the number 2 is small.26 the mean is 10.3333 2. to fly from London to Vladivostok is 2 hours. s Table 2. to its mean.200. μ Population variance. σ N 1 Sample variance.7080 μ)2 Number of values.2857 2. σ2 Population standard deviation. And the deviation of the data around the mean value of 10 is as follows: (9 10) (13 10) (12 10) (7 (6 10) (11 10) (12 10) 10) 0 This is perhaps a logical conclusion since the mean value is calculated from all the dataset values.25 deviation. The coefficient of variation is a relative measure of the standard deviation of a distribution.56 Statistics for Business Table 2. and a high 7. In general terms a small value for the standard deviation indicates that the dispersion of the data is low and conversely the dispersion is large for a high value of the standard deviation. However.26 Deviations about the mean value. if you convert that to minutes the value is 120. Further. s2 Sample standard deviation. However the magnitude of these values depends on what you are analysing. 9 13 12 7 6 11 12 In the dataset of Table 2. x 2 13 12 7 6 11 12 7 70 10 (x 1 3 2 3 4 1 2 μ) (x 1 9 4 9 16 1 4 44 6. The .

27 gives a summary. Another divider of data is the quartiles or those values that divide ordered data into four equal parts. This value is small and perhaps would be acceptable from a quality control point of view.06 is a small number but it gives a coefficient of variation of 0. in absence of the values for the population. It is defined as follows: Coefficient of variation σ μ 2(x) Operator A Operator B 57 Table 2. or sales’ revenues are positioned relative to standardized data. Table 2. which is 0. The value 0.20 As an illustration. the 2nd quartile Q2. the value for Operator A is 8/45 or 17.25 cm or 0. and the standard deviation of the length of the rods that are cut is 0. However.04 or 4%.06 m.78% and for Operator B it is 14/125 or 11. say that the standard deviation is 6 cm or 0. Quartiles are useful to indicate where data such as student’s grades. In summary then there are the following five boundary limits: Q0 Q1 Q2 Q3 Q4 The quartile values can be determined by using in Excel [function QUARTILE].Chapter 2: Characterizing and defining data coefficient of variation can be either expressed as a proportion or a percentage of the mean.50 0.17%. in a manufacturing operation two operators are working on each of two machines.20%.5 m.25/150 (keeping all units in cm). However.0017 or 0. There are then three quartiles within these outer limits. Operator B completes on average 125 units/day with a standard deviation of 14 units. However if we compare the coefficient of variations. Which operator is the most consistent in the activity? If we just examine the standard deviation. The 1st quartile is Q1. say that a machine is cutting steel rods used in automobile manufacturing where the average length is 1. and thus might be considered more erratic. a person’s weight.27 Coefficient of variation. . The coefficient of variation is also a useful measure to compare two sets of data. denoted as Q0. and the upper limit is the maximum value Q4. the variability for Operator B is less than for Operator A because the mean output for Operator B is more. On this comparative basis. Operator A produces an average of 45 units/day. We then have the boundary limits of the quartiles which are those values that divide the dataset into four equal parts such that within each of these boundaries there is 25% of the data. The term σ/μ is strictly for the population distribution. Quartiles In the earlier section on Central Tendency of Data we introduced the median or the value that divides ordered data into two equal parts. In this case the coefficient of variation is 0. For example. This value is probably unacceptable for precision engineering in automobile manufacturing. sample values of s/x-bar will give you an estimation of the coefficient of variation. or four equal quarters.0025 m. Between these two values is contained 100% of the dataset. With this division of data the positioning of information within the quartiles is also a measure of dispersion.06/1. Boundary limits of quartiles The lower limit of the quartiles is the minimum value of the dataset. it appears that Operator B has more variability or dispersion than Operator A. and the 3rd quartile Q3. with a standard deviation of the number of pieces produced of 8 units.78 11. Mean output μ 45 125 Standard Coefficient of deviation variation σ σ/μ (%) 8 14 17.

875 137.832 121.975 134.298 115. One half of the inter-quartile range.695 82.198 110.569 184.598 68.456 143.895 97.958 89.498 107.489 163.490 138. or (Q3 Q1)/2.895 81.897 73.856 110.583 150.986 60.378 109.987 58.564 93.128 122.895 63.259 64.698 45.982 50.856 83.985 132. The smaller the quartile deviation.458 112.651 98.589 136.598 133.654 100.987 72.695 105.498 88.295 95.864 160.999 90.489 170.456 141.456 96.128 77. distortion from extreme values is limited as the quartile values are taken from an ordered set of data.589 152.659 125.695 89.597 99.895 100.987 60.754 101.598 128.597 120. the greater is the concentration of the middle half of the observations in the dataset.489 87.964 56.562 37.459 96.259 65.985 151.980 120.598 66. which is the difference between the 3rd and the 1st quartile in a dataset or (Q3 Q1).598 78.985 103.985 104.847 124.654 97.879 126.489 106.296 71.987 165.785 74. Also indicated is the inter-quartile range.654 118.58 Statistics for Business Table 2.189 101.479 116.957 Q3 Q1 (Q3 Q1)/2 (Q3 Q1)/2 Mean Mid-spread 43. these additional quartile properties only use two values in their calculation.976 100.698 81. Although like the range.584 97.689 114.897 68.865 152.859 121.958 102. .975 76.653 99.987 59.856 64.486 131.865 164.976 100. (Q3 Q1)/2.856 144.598 85.975 82.215 107.295 83.28.592 94.968 Mid-hinge 101.985 92.486 85.598 80. It measures the range of the middle 50% of the data.689 101.598 95.935 Quartile deviation 21.479 73.128 86.157 98.859 146.987 77.432 140.128 135. This information is shown in Table 2.189 124.786 132.865 103.564 113.958 113.785 108.654 124.945 84.958 54.958 106. or mid-spread.487 81.654 69. 35.896 109.298 88.984 90.667 Properties of quartiles For the sales data of Chapter 1.895 102.985 127.985 47.598 153.856 66.694 106.695 101.298 92.985 86.825 165.759 58.562 84.654 149.28 Quartiles for sales revenues.796 123.598 129. which gives the five quartile boundary limits plus additional properties related to the quartiles.958 71.895 145.985 96.326 99.298 111.985 80.895 102.289 130.562 156.492 103.958 63.894 75.943 102.312 119.854 76.598 77.895 52.568 103.985 122.958 168.592 108.590 89.897 70.985 123.598 55.265 142.654 62.654 70.589 105. we have developed the quartile values using the quartile function in Excel.378 79.597 85.957 91.298 61. The mid-hinge.957 117.578 161.459 136.569 139.894 93.984 74.489 69.987 87.987 86.598 47.258 162.587 104.459 78. is a measure of central tendency and is analogous to the midrange.897 134.459 138.911 184.698 78. is the quartile deviation and this measures the average range of one half of the data.296 123.562 89.890 72.632 123.298 112.859 67.569 104.876 87.489 82.874 Quartile Q0 Q1 Q2 Q3 Q4 Position 0 1 2 3 4 Value 35.498 56.592 111.965 184.598 120.564 79.584 133.985 91.598 134.

are shown as two horizontal lines. The 25% of the values that lie to the left of the box and the 25% of the values to the right of the box. Also. The box and whisker plot is right-skewed if the distance from Q2 to Q4 is greater than the distance from Q0 to Q2 and the distance from Q3 to Q4 is greater than the distance from Q0 to Q1. Q2.296 Q3 123. and the vertical line of the right-hand side of the box is the 3rd quartile. and the distance from Q2 to Q4 are the same.000 120. Q0.000 60.000 Sales ($) Box and whisker plot A useful visual presentation of the quartile values is a box and whisker plot (from the face of a cat – if you use your imagination!) or sometimes referred to as a box plot.000 180. Q1 79.Chapter 2: Characterizing and defining data 59 Figure 2. the mean value is greater than the median. Q4. The larger the width of the box relative to the two whiskers indicates that the data is clustered around the middle 50% of the values. or the other 50% of the dataset. or whiskers.000 140. The box and whisker plot for the sales data is shown in Figure 2.1 Box and whisker plot for the sales revenues. The extreme left part of the first whisker is the minimum value.000 200.000 160. Here.976 Q2 100.957 0 20. The box and whisker plot is symmetrical if the distances from Q0 to the median. In addition. and the extreme right part of the second whisker is the maximum value.1.000 100. the distance from Q0 to Q1 equals the distance from Q3 to Q4 and the distance from Q1 to Q2 equals the distance from the Q2 to Q3 and further the mean and the median values are equal. This means that the data values to the right of the median are more dispersed than those to the . the middle half of the values of the dataset or the 50% of the values that lie in the interquartile range are shown as a box. The vertical line making the left-hand side of the box is the 1st quartile.000 80.911 Q0 35.378 Q4 184.000 40.

1 2 3 4 5 6 7 8 9 10 11 12 13 X Q0 Q1 Q1 Q2 Q2 Q1 Q1 Q3 Q3 Q2 Q3 Q3 Q4 Y 2 2 3 3 1 1 3 3 1 1 1 2 2 Drawing the box and whiskerplot with Excel If you do not have add-on functions with Microsoft Excel one way to draw the box and whisker plot is to develop a horizontal and vertical line graph. $90. Percentiles The percentiles divide data into 100 equal parts and thus give a more precise positioning of where information stands compared to the quartiles. The box and whisker plot in Figure 2. the box and whisker plot is left-skewed if the distance from Q2 to Q4 is less than the distance from Q0 to Q2 and the distance from Q3 to Q4 is less than the distance from Q0 to Q1. 2.000? From the box and whisker plot of Figure 2.29 Coordinates for a box and whisker plot. where would we position Region A which has sales of $60. An amount of $120. a box and whisker plot is another technique in exploratory data analysis (EDA) that covers methods to give an initial understanding of the characteristics of data being analysed. Also. The reason that there are 13 coordinates is that when Excel creates the graph it connects every coordinate with a horizontal or vertical straight line to arrive at the box plot including going over some coordinates more than once. Q2. the mean value is less than the median. For example. the whiskers and the centre part of the box have the arbitrary value of 2.1 an amount of $60. This means that the data values to the left of the median are more dispersed than those to the right. Q3.000 is within the 1st quartile and not a great performance. Q1. Region B which has sales of $90. Conversely. Table 2.000 is within the 2nd quartile or within the box or the middle 50% of sales. and 3.60 Statistics for Business left of the median.000 is within the 4th quartile and a superior sales performance. Determine the five quartile boundary values Q0.000. We now ask the question. As mentioned in Chapter 1. an amount of $150. and the upper part of the box has the arbitrary value of 3. For the 2nd column you enter the corresponding quartile value. As the box and whisker plot has only three horizontal lines the lower part of the box has the arbitrary y-value of 1. Point No. The procedure for drawing the box and whisker plot is as follows. Again the performance is not great. There is further discussion on the skewed properties of data in Chapter 5 in the paragraph entitled Asymmetrical Data. and Q4 using the Excel quartile function. Finally. Say once we have drawn the box and whisker plot. and Region D which has sales of $150. the sales data from which it is constructed is considered our reference or benchmark.000. Set the coordinates for the box and whisker plot in two columns using the format in Table 2.000. The x-axis is the quartile values and the y-axis has the arbitrary values 1. Region C which has sales of $120.1 is slightly right-skewed. paediatricians will measure .000 is within the 3rd quartile and within the box or the middle 50% of sales and is a better performance.29.

Using the same sales revenue information that we used for the quartiles.861 68.397 71. This information can be used as an indicator of the growth pattern of the child.877 130.519 160.2.865 86.000.620 78. Development of percentiles We can develop the percentiles using [function PERCENTILE] in Excel.277 117.426 152.296 100. etc.499 70.342 111.896 58.654 91.595 97. Using this data we have developed Figure 2.707 66.384 83.680 88. Another use of percentiles is in examination grading to determine in what percentile.894 52.694 Percentile Value (%) ($) 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 76.189 73.858 95.267 118.764 105. and Region D which has sales of .865 108. and 90% have a height greater than that of your child. When you call up this function you are asked to enter the dataset and the value of the kth percentile where k is to indicate the 1st.437 56.15 or 15%.584 150. Region C which has sales of $120.709 184.592 133. Say once again as we did for the quartiles.137 99.589 79.811 110.911 124.864 135.657 153.318 78.533 98. which shows the percentiles as a histogram.833 Percentile (%) 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Value ($) 92.682 143.843 Percentile (%) 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 Value ($) 132.187 128.072 61.30 gives the percentiles for this information using the percentage to indicate the percentile.000.778 90.670 109.040 99.166 123.320 72.116 123.682 89.025 164.341 163. you do not have to sort the data – Excel does this for you. For example assume the paediatrician says that for your child’s height he is in the 10th percentile.000.556 89.Chapter 2: Characterizing and defining data 61 Table 2.589 77.199 64.098 122.206 85. When you enter the value of k it has to be a decimal representation or a percentage of 100.378 45.756 170.404 85.614 121.095 145.975 60.963 134.830 100.30 Percentile (%) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Percentiles for sales revenues.085 146.238 106.492 102.130 65.957 the height and weight of small children and indicate how the child compares with others in the same age range using a percentile measurement.337 84.976 81.987 103.197 81.407 102.954 120.811 140.075 96.720 115. where would we position Region A which has sales of $60.325 165.396 75.962 137.972 101. 2nd.407 106. For example a percentile of 15% is the 15th percentile or a percentile of 23% is the 23rd percentile.101 96.717 93. This means that only 10% of all children in the same age range have a height less than your child. Region B which has sales of $90.899 113. a student’s grade falls. For example the 15th percentile has to be written as 0.848 82.809 69. Table 2.985 104.116 47.675 55.086 127. we ask the question.632 112.394 74. 3rd percentile.660 125.958 103.204 63.962 138.101 136.160 87.723 87. As for the quartiles.839 Percentile Value (%) ($) 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 107. Value ($) 35.

296. by using the quartiles – four equal parts. Division of data We can divide up data by using the median – two equal parts. which means that 93% of the sales are greater than this region and 7% are less – a poor performance.000 is at roughly the 92% percentile which signifies that only 8% of the sales are greater than this region and 92% are less – a good performance.000 100. is also 100. For the raw sales data given in Table 1.000? From either Table 2.296 (indicated at the end of paragraph median of this chapter). At the $120. In this case the median value equals the 3rd quartile which also equals the 50th percentile.000 Sales revenues ($) 120.30 or Figure 2. By describing the data using percentiles rather than using quartiles we have been able to be more precise as to where the region sales data are positioned.000 40. is also 100.000 this is at about the 7% percentile.62 Statistics for Business $150.000 140. given in Table 2.000 160.000 this is roughly the 39% percentile. Q2.000 level this is about the 71% percentile.000 0 Percentile (%) 95 10 0 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 .000 20. which means that 61% of the sales are greater than this region and 39% are less – not a good performance. which means that 29% of the sales are greater than this region and 71% are less – a reasonable performance.2 then we can say that for $60.2 Percentiles of sales revenues.000 60. For $90.000 80. 200. Figure 2. Finally $150.000 180. given in Table 2.1 the median value is 100. or using the percentiles – 100 equal parts.296 and the value of the 50th percentile. the value of the 2nd quartile.30.28.

The median is not affected by extreme values but may be affected by the number of values. We also have Q4. The most meaningful measures of dispersion are the variance and the standard deviation. The variance has squared units. The range is an often used measure of dispersion but it is not a good property as it is affected by extreme values. which is the average of the highest and lowest value in the dataset and this is very much dependent on extreme values. Dispersion of data The dispersion is the way that data is spread out. We might use the weighted average when certain values are more important than others. both of which take into consideration every value in the dataset. Q2 – the second. The mode is a measure of central tendency and is that value that occurs most often. Q1 – the first. which is the ratio of the standard deviation to its mean value. which is the sum of the data divided by the number of data points. Mathematically the standard deviation is the square root of the variance and it is more commonly used than the variance since it has the same units of the dataset from which it is derived. For a given dataset. The mean value can be distorted by extreme value or outliers. The most common measure of central tendency is the mean or average value. quartiles. From the quartiles we can . Data that is highly dispersed is unreliable compared to data that is little dispersed. then we would use the geometric mean as the measure of central tendency.Chapter 2: Characterizing and defining data 63 This chapter has detailed the meaning and calculation of properties of statistical data. Although. Chapter Summary Central tendency of data Central tendency is the clustering of data around a central or a middle value. A simple way to compare the relative dispersion of datasets is to use the coefficient of variation. By developing quartiles we can position information within the quartile framework and this is an indicator of its importance in the dataset. as for example interest rates each year. We also have the median. and percentiles. we also refer to Q0. dispersion. or that value that divides data into two halves. which is the start value in the quartile framework and also the minimum value. it gives us an indicator of its reliability for analytical purposes. The value of the 2nd quartile. Q2. If we know the central tendency this gives us a benchmark to situate a dataset and use this central value to compare one dataset with another. which is the last value in the dataset. or the maximum value. There is the midrange. which we have classified by central tendency. and Q3 – the third. If we know how data is dispersed. Thus there are five quartile boundary limits. If data are changing over time. The mode can be used for qualitative responses such as the colour that is preferred. is also the median value as it divides the data into two halves. Quartiles The quartiles are those values that divide ordered data into four equal values. the standard deviation of the sample is always more than the standard deviation of the population since it uses the value of the sample size less one in its denominator whereas the population standard deviation uses in the denominator the number of data values. there are really just three quartiles.

The left-hand whisker represents the first 25% of the data. in quartiles. and the right-hand whisker represents the last 25%.64 Statistics for Business develop a box and whisker plot. The box and whisker plot is distorted to the right when the mean value is greater than the median and distorted to the left when the mean is less than the median. Analogous to the range. Percentiles Percentiles are those values that divide ordered data into 100 equal parts. analogous to the midrange we have the mid-hinge which is the average of the 3rd and 1st quartile. which is a visual display of the quartiles. Percentiles are useful in that by positioning where a value occurs in a percentile framework you can compare the importance of this value. The 50th percentile in a dataset is equal to the 2nd quartile both of which are equal to the median value. For example. Also. . The middle box represents the middle half. we have the inter-quartile range. in the medical profession an infant’s height can be positioned on a standard percentile framework for children’s height of the same age group which can then be an estimation of the height range of this child when he/she reaches adulthood. which is the difference between the 3rd and 1st quartile values. or 50% of the data.

Weight category Price ($/package) Number of packages Less than 1 kg 10.00 37.545 From 10 to 50 kg 6.00 90. what would be the correct average billing rate used to price a project? 2.50 950 Required 1. What would be the billing rate if the straight arithmetic average were used? 2.000 Computing services 35. junior engineers. This information together with the number of packages delivered last year is given in the table below.000 Assistants 22.00 19.00 82.00 32. and assistants on its projects. Delivery Situation A delivery company prices its services according to the weight of the packages in certain weight ranges. computing services. If next year it was estimated that 400. If the estimate for performing a future job were 110. If this data was used for quoting on future projects.000 packages would be delivered. The billing rate to the customer for these categories is given in the table below together with the hours used on a recent design project.500 Greater than 50 kg 5.00 9.500 From 5 to 10 kg 7. what would be the billing amount to the customer? 3. Category Billing rate ($/hour) Project hours Senior engineers 85.500 Required 1. What is the average price paid per package? 2.000 hours. what would be an estimate of revenues? .00 23.000 From 1 to 5 kg 8. Billing rate Situation An engineering firm uses senior engineers.Chapter 2: Characterizing and defining data 65 EXERCISE PROBLEMS 1.00 120.000 Junior engineers 45.

70 9. Investment Situation Antoine has $1.90 9.50 3.50 Year Option 2 Interest rate (%) 8.20 7. The interest rates for the following two options are in the tables below.950 2004 16. What is the average annual growth rate.50 7.20 3. Option 1 Year Interest rate (%) 6. He has been promised two options of investing his money if he leaves it invested over a period of 10 years with interest calculated annually.000 to invest.980 . What would be the value of his investment at the end of 10 years if he invested in Option 1? 4.70 3. Production Situation A custom-made small furniture company has produced the following units of furniture over the past 5 years.50 4. geometric mean for Option 2? 3.650 2002 15.10 7.90 3.20 4.66 Statistics for Business 3.50 8.70 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Required 1. Which is the preferred investment? 6. What is the average annual growth rate.70 4.20 6.50 6. What would need to be the interest rate in the 10th year for Option 2 in order that the value of his asset at the end of 10 years for Option 2 is the same as for Option 1? 4.30 4.890 2003 15.00 7. geometric mean for Option 1? 2.50 9. What would be the value of his investment at the end of 10 years if he invested in Option 2? 5.250 2001 14. Year Production (units) 2000 13.

40 0.51 0.49 Stamp for postcard 0.99 15. 4.45 0.24 2.34 0.11 0.95 2.99 22. median.200 Big Mac 2.04 0. p.300 16.50 2.50 3.69 Renault Mégane 15.500 1999 3.895 20.33 Required 1.60 2.45 Compact disc 19.71 17. 5/6 January 2002.100 21. range.77 0.79 0.70 0.11 2.90 3.71 1. Students Situation A business school has recorded the following student enrolment over the last 5 years Year Students 1997 3.84 0.56 1.00 16. What is the average percentage growth in this period? 2.875 17.650 13.65 2.99 21.800 Required 1.51 0. and the estimated coefficient of variation using the sample values for all of the items indicated. If this rate of percentage increase is maintained.38 0. midrange.459 14.3 Milk (1 l) Austria Belgium Finland France Germany Greece Ireland Italy Luxembourg The Netherlands Portugal Spain 0.93 16.770 12.59 0.48 0.47 0. Determine the maximum.86 0. what would be the production level in 2008? 5.47 1. .00 2.10 2. minimum. sample standard deviation. 2.80 Can of Coke 0.83 1.51 0.72 0. What is the average percentage increase in this period? 2.60 0. Euro prices Situation The table below gives the prices in Euros for various items in the European Union.57 14.450 2000 3.18 0.700 17.54 0.35 0.450 16.54 2.600 2001 3.52 0. average. what would be the student population in 2005? 3 International Herald Tribune.44 0.275 1998 3.Chapter 2: Characterizing and defining data 67 Required 1.99 21.41 0.50 22.780 14. If this average growth rate is maintained.52 0.50 0. What observations might you draw from these characteristics? 6.37 0.95 21.98 17.54 0.700 15.

on a privatized rail line in the United Kingdom. From this sample information. what is an estimate of the coefficient of variation? 9. What is the average percentage price increase in this period? 2. what would be the price in 2003? 8. what is the average number of trains late? 2. what is the sample variance? 7. If this rate of price increase is maintained. Construction Situation A firm purchases certain components for its construction projects. 25 15 20 17 42 13 42 39 45 35 20 25 15 36 7 32 25 30 25 15 3 38 7 10 25 Required 1.50 1997 110.15 2004 8. what is the range? 5. Net worth Situation A small firm has shown the following changes in net worth over a 5-year period. From this information. From this information. From this information.68 Statistics for Business 7. From this information.45 1999 122. From this information. From this information.56 2000 125. what is the median value of the number of trains late? 3.25 2001 9. Year Growth (%) 2000 6. What can you say about the distribution of the data? .75 Required 1.80 1998 115. Trains Situation A sample of the number of late trains each week. what is the mode value of the number of trains late? How many times does this modal value occur? 4. The price of these components over the last 5 years has been as follows. What is the average change in net worth over this period? 9. was recorded over a period as follows.90 Required 1.25 2002 8. Year Price ($/unit) 1996 105. From this information. what is the sample standard deviation? 8. what is the midrange? 6.75 2003 7.

p. 31 August 2004. Greece. . Summer Olympics 2004 Situation The table below gives the final medal count for the Summer Olympics 2004 held in Athens.Chapter 2: Characterizing and defining data 69 10.4 Country Argentina Australia Austria Azerbaijan Bahamas Belarus Belgium Brazil Britain Bulgaria Cameroon Canada Chile China Columbia Croatia Cuba Czech Republic Denmark Dominican Republic Egypt Eritrea Estonia Ethiopia Finland France Georgia Germany Greece Hong Kong Hungary India Indonesia Iran Ireland Israel Italy Jamaica 4 Gold 2 17 2 1 1 2 1 4 9 2 1 3 2 32 0 1 9 1 2 1 1 0 0 2 0 11 2 14 6 0 8 0 1 2 1 1 10 2 Silver 0 16 4 0 0 6 0 3 9 1 0 6 0 17 0 2 7 3 0 0 1 0 1 3 2 9 2 16 6 1 6 1 1 2 0 0 11 1 Bronze Country 4 16 1 4 1 7 2 3 12 9 0 3 1 14 1 2 11 4 6 0 3 1 2 2 0 13 0 18 4 0 3 0 2 2 0 1 11 2 Japan Kazakhstan Kenya Latvia Lithuania Mexico Mongolia Morocco The Netherlands New Zealand Nigeria North Korea Norway Paraguay Poland Portugal Romania Russia Serbia-Montenegro Slovakia Slovenia South Africa South Korea Spain Sweden Switzerland Syria Taiwan Thailand Trinidad and Tobago Turkey Ukraine United Arab Emirates United States Uzbekistan Venezuela Zimbabwe Gold 16 1 1 0 1 0 0 2 4 3 0 0 5 0 3 0 8 27 0 2 0 1 9 3 4 1 0 2 3 0 3 9 1 35 2 0 1 Silver 9 4 4 4 2 3 0 1 9 2 0 4 0 1 2 2 5 27 2 2 1 3 12 11 1 1 0 2 1 0 3 5 0 39 1 0 1 Bronze 12 3 2 0 0 1 1 0 9 0 2 1 1 0 5 1 6 38 0 2 3 2 9 5 2 3 1 1 4 1 4 9 0 29 2 2 1 International Herald Tribune. 20.

If we added in printing. and one point for a bronze medal. For product costing purposes. If the number of gold medals won is the criterion for rating countries. .75 Packing 15.5 (This is the information presented in the Box Opener. which countries in order are in the first 10? 3. Which three countries have the highest percentage of gold medals out of all the gold medals awarded? 11. where the wages are $25.50 Trimming 13.70 Statistics for Business Required 1. Operation Wages ($/hour) Hours per 100 units Binding 14. If the total number of medals won is the criterion for rating countries. Big Mac Situation The table below gives the price a Big Mac Hamburger in various countries converted to the $US. What is the average medal count per country for those who competed in the Summer Olympics? 5. then what would be the new correct average wage rate for the operation? 12.00 hour and the production time is 45 minutes per 100 units. what is the correct average rate per hour for 100 units for this part of the printing operation? 2.70 1. which countries in order are in the first 10? 2. which countries in order are in the first 10? Indicate the weighted average for these 10 countries.) 5 See Note 1.00 1. Develop a histogram for the percentage of gold medals by country for those who won a gold medal.25 Required 1. 4.25 1. two points for a silver medal. Printing Situation A small printing firm has the following wage rates and production time in the final section of its printing operation. If there are three points for a gold medal.

17 2.06 2.63 2. and Denmark? What initial conclusions could you draw from this information? 6.58 1. Where in the quartile distribution do the prices of the Big Mac occur in Indonesia.48 2.41 1. Determine the following characteristics of this data: (a) Maximum (b) Minimum (c) Average value (d) Median (e) Range (f) Midrange (g) Mode and how many modal values are there? (h) Sample standard deviation (i) Coefficient of variation using the sample standard deviation 2.92 3.38 Country Mexico New Zealand Peru Philippines Poland Russia Singapore South Africa South Korea Sweden Switzerland Taiwan Thailand Turkey United States Venezuela Price ($US) 2.58 1. What are the boundary limits of the quartiles? 4.96 1. .50 2.34 1.60 1.47 1. 3.05 2. The purchases include all food and non-food items.48 2.39 3.54 2. Illustrate the price of a Big Mac on a horizontal bar chart sorted according to price.13 Required 1. What is the inter-quartile range? 5.58 3.10 2.53 2.27 2.53 2.Chapter 2: Characterizing and defining data 71 Country Argentina Australia Brazil Britain Canada Chile China Czech Republic Denmark Egypt Euro zone Hong Kong Hungary Indonesia Japan Malaysia Price ($US) 1.17 5.17 2.49 4.64 2. Hungary. Purchasing expenditures – Part II Situation The complete daily purchasing expenditures for a large resort hotel for the last 200 days in Euros are given in the table below.44 2. Draw a box and whisker plot for this data. Singapore.55 3.76 1.30 4. 13.

320 235.466 194.651 190.157 295.577 157.377 175.523 212.212 161.811 185.973 144.173 273.777 108.256 141.411 94. laundry which is a purchased service.886 187.211 262.696 102.377 126.171 134.626 141. 2nd quartile.577 224.613 195.377 106.741 119.412 282.621 173.124 97.221 254.496 159.162 215. 4.923 165.777 180.676 231.230 221.276 86.421 259.251 175.212 152.462 161.157 295.476 242. energy including water for the three swimming pools. and the y-axis the dollar value of the retail sales.862 132. Plot this information on a histogram with the x-axis being the percentile value.880 307.977 298.257 188.251 241.157 180.355 288.175 248.346 263.826 173.480 192.677 146.72 Statistics for Business and wine for the five restaurants in the complex.676 141.187 194.866 170.613 197.076 99.249 270.612 68.525 224.411 185. Construct a box and whisker plot.466 118.864 293.773 156.741 150.336 201.524 303.876 217.173 238.076 120.573 86.230 139.682 249.568 90.937 332.680 197.211 114.496 194.226 246.651 148.375 108.609 168.766 106.211 181.787 179.155 147.615 124.173 188.186 225.833 223.977 157.024 332.731 151.956 198. What can you say about the distribution of this data? 5.157 274.573 187.755 242.936 208.876 154.415 132.173 230.533 128.135 102.138 222. gasoline for the courtesy vehicles.146 122.377 130.840 206.856 147.957 183.276 144.137 137.875 179.415 242.880 148.373 167.531 171.173 223.141 260.624 188.101 152.775 179.262 210.424 251. Verify that the median value.978 253.372 177.475 217.124 185.336 203.724 113.215 238.377 155.430 244.340 224.536 166.415 127. Determine the boundary limits for the quartile values for this data.415 142.426 249.536 110.476 172.898 218.240 182.177 112.802 130.524 192.880 157.240 291.880 125.876 140.741 115.849 191.973 173.624 203.773 Required 1.256 81. gardening and landscaping services.211 175.476 241.075 237.777 125.124 128.977 139.651 182. 3.777 213. 63.571 163.213 134.382 228.077 257.466 116.577 269.860 246.540 182.622 187.336 159.283 177.157 187.373 220. and the 50th quartile are the same.230 156.015 205.324 161.860 274.998 163.266 195.973 165.651 161.409 136.276 233. Using the raw data determine the following data characteristics: (a) Maximum value (you may have done this in the exercise from the previous chapter) (b) Minimum value (you may have done this in the exercise from the previous chapter) (c) Range (d) Midrange (e) Average value (f) Median value (g) Mode and indicate the number of modal values (h) Sample variance (i) Standard deviation (assuming a sample) (j) Coefficient of variation on the basis of a sample 2.011 146.564 217.075 154. .076 275.076 124.275 153. Determine the percentile values for this data.285 297.124 204.746 219.155 137.777 106.860 190.215 168.141 198.

From this information determine the following properties of the data: (a) The sample size (b) Maximum value (c) Minimum value (d) Range (e) Midrange (f) Average value (g) Median value (h) Modal value and how many times does this value occur? (i) Standard deviation if the data were considered a sample (which it is) (j) Standard deviation if the data were considered a population (k) Coefficient of variation (l) The quartile values (m) The inter-quartile range (n) The mid-hinge 2. Determine the percentiles for this data and plot them on a histogram. What are your observations about the box plot? 4. The community is considering building a restaurant facility in the swimming pool area but before a final decision is made.019 630 692 609 798 823 650 776 712 651 952 729 825 791 830 878 507 769 780 871 732 539 565 926 843 795 794 778 763 773 743 759 968 658 869 821 940 903 993 761 764 919 861 580 620 796 560 709 826 790 847 763 779 682 610 669 852 825 751 1. it wants to have assurance that the receipts from the attendance at the swimming pool will help finance the construction and operation of the restaurant. Using the quartile values develop a box and whisker plot.004 915 744 724 811 895 621 709 743 808 810 728 792 883 680 880 748 806 619 Required 1.088 750 931 901 726 678 672 582 716 749 685 790 785 835 869 837 745 690 829 748 980 860 707 907 830 956 878 755 874 1. 869 678 835 845 791 870 848 699 930 669 822 609 755 1. In order to give some justification to its decision the community noted the attendance for one particular year and this information is given below. which is open to the public each year from May 17 until September 13. 3. Swimming pool – Part II Situation A local community has a heated swimming pool. .Chapter 2: Characterizing and defining data 73 14.

7 13. Czech Republic.5 7.5 Required 1.5 9.3 11.3 9. tri.5 14.5 10.9 6. bi.4 5.1 12.5 12.8 10.6 10.6 8.7 11.7 11.8 8.5 11. etc.1 11.7 10. Bonn.1 11. a grocery chain in the Greater London area of the United Kingdom. 8. Determine the percentile values for the data and plot these on a histogram.5 7. Case: Starting salaries Situation A United States manufacturing company in Chicago has several subsidiaries in the 27 countries of the European Union including Calabas. The profits from these 50 stores.7 10. It is planning to hire new engineers to work in these subsidiaries and needs to decide on the starting salary to offer these new hires.8 7.5 7.7 8. Buyout – Part II Situation Carrefour.7 12. 16.9 9.4 9.3 12. 3.1 9. The human resource department of the parent firm in Chicago. These new engineers will be hired from their country of origin to work in their home country. in £’000s.6 11.6 11. for one particular month.1 6. is considering purchasing the total 50 retail stores belonging to Hardway. Spain.3 12. who is not too .5 10.2 15.5 10. Using the raw data determine the following data characteristics: (a) Maximum value (this will have been done in the previous chapter) (b) Minimum value (this will have been done in the previous chapter) (c) Range (d) Midrange (e) Average value (f) Median value (g) Modal value and indicate the order of modality (single.8 11.6 10.1 12.7 9.) (h) Standard deviation assuming the data was a sample (i) Standard deviation taking the data correctly as the population 2.2 8. Determine the quartile values for the data and use these to develop a box and whisker plot. are as follows.3 10. Watford.3 10. United Kingdom.3 13. France.74 Statistics for Business 15. Germany and Louny.9 9.6 9.2 10.

076 26.802 34.114 33.388 34.796 27.838 33.310 34.772 27.018 31.828 32.388 28.586 29.152 29.302 35.378 27.564 34.576 29.908 33.242 32.886 23.012 31. All these starting salaries include all social benefits and mandatory employer charges which have to be paid for the employee.442 28.974 36. This database.774 25.456 30.902 31.902 26.180 32.892 27.914 29.466 39.006 34.620 29.456 29.956 37.078 33.312 28.614 30.644 37.248 34.990 36.836 32.274 29.132 29.804 33.884 27.110 40.580 24.588 24.786 33.928 27.202 38.142 30.020 32.924 41.842 37.754 34.454 30.870 33.500.632 27.224 29.972 25.752 32.774 35.634 34. Use the information from this current chapter.164 33.894 33. Required Assume that you work with the human resource department in Chicago.878 33.758 28.010 30.376 23. textiles.668 33.724 30.530 30.414 29.114 25.580 33.978 29.160 33.622 36.188 36.478 25.850 31.866 30.024 35.648 33.044 33.220 24.126 35.178 27.070 32.546 35.982 37.722 30.860 25.700 and Jitka Sikorova a starting salary of €28.650 29.594 36.052 36.154 34.784 32.242 27.662 24.870 37.728 33.724 31.650 36. is given in the table below.434 29.660 27.588 26.742 35.200 35. 34. At the present time.380 33.490 29.276 31.652 26.252 35.518 32.906 34.648 27.198 26.072 26.964 26.762 36. with values converted to Euros.994 29.012 26.860 36.280 35.842 24.348 33.692 33.990 (Continued) .250 29.592 30.224 36.528 32.528 34. has the option to purchase a database of annual starting salaries for engineers in the European Union from a consulting firm in Paris.334 34.898 30.174 34.566 32.196 40.324 29.410 31.776 34.394 29.136 33.858 30.038 27.404 31. chemicals.072 32.886 29.580 35.992 24.450 27.086 31.366 38. the Chicago firm is considering hiring Markus Schroeder.282 28.488 30.302 28.192 41.884 24.030 28.756 29.204 32. pharmaceutical.472 34.268 30.958 34.566 27.240 33.700.504 30.146 27.332 31.214 31.012 36.292 36.122 32. describe the characteristics of the four starting salaries that have been offered and give you comments.700 33.410 28.Chapter 2: Characterizing and defining data 75 familiar with the employment practices in Europe.884 31.828 37.858 28.380 34. It was compiled from European engineers working in the automobile.902 28.216 34.662 34.756 25.750 35.356 33.928 35.396 31.564 35.822 31.912 38.022 33.082 35.056 28.974 35.606 25.900.782 25.898 26.184 34.184 27.212 33.264 21.370 35. offering a starting salary of €36.370 37.478 30.310 30.052 24.052 25.616 25.824 34. Then in using your results.016 31.800 31.782 28.658 33.944 26.572 42.202 24.370 33.752 33.654 40.102 31. and oil refining sectors.490 31.522 25.032 29.328 29.400 33.616 28.098 31.060 38.342 29.750 34.640 35.294 36.044 31.610 37.704 33.538 31.926 36.312 31.104 39. Joan Smith a salary of €32. Xavier Perez offering a salary of €30.246 34. and also from Chapter 1 to present in detail the salary database prepared by the Paris consulting firm.852 29.488 34.712 29.054 33.368 36.454 30.400 29.296 31.668 31.168 28.834 28.678 35.208 35.726 35. food.936 32.374 29.568 36.004 37.412 37.556 35. aeronautic.902 25.718 29.050 28.840 39.706 31.062 31.468 24.948 27.586 30.342 30.068 34.422 28.638 32.612 43.932 26.144 36.766 31.514 35.624 30.144 31.630 31.200 28.980 28.788 39.282 38.124 28.032 28.

628 36.196 28.312 28.524 27.906 37.214 29.882 31.568 28.468 31.882 30.486 30.310 37.468 31.768 27.016 27.328 29.394 30.858 36.482 33.608 32.828 29.492 33.596 22.934 34.196 36.968 36.148 28.696 22.286 33.134 30.376 33.346 33.766 22.136 37.942 40.584 33.806 32.254 27.118 29.496 33.994 29.664 30.732 39.776 30.726 29.840 30.568 31.592 32.938 35.252 23.914 34.400 32.838 30.570 38.266 31.178 33.762 41.818 31.396 33.744 25.924 38.832 31.594 31.264 31.378 29.902 27.676 32.042 39.750 28.052 37.470 34.784 23.746 29.418 31.700 34.562 24.100 35.110 39.490 28.738 30.910 32.990 27.434 31.210 27.942 32.384 27.754 36.100 32.616 30.472 32.010 33.214 29.062 30.426 29.178 27.490 25.876 31.782 38.030 27.742 25.966 32.148 35.426 29.544 35.268 33.796 33.834 33.768 34.214 25.892 25.158 32.090 36.518 27.738 35.596 35.644 34.300 37.354 20.402 31.652 29.632 35.596 34.832 30.442 31.616 34.052 33.846 33.530 26.390 33.844 39.358 25.716 31.132 30.384 33.326 32.868 30.376 23.164 31.854 32.596 28.470 36.386 29.742 35.524 33.772 31.600 27.754 35.048 37.392 33.038 34.166 35.336 36.426 34.150 33.218 34.512 33.182 26.544 28.270 27.332 24.976 27.128 27.168 24.256 30.754 22.446 35.170 39.622 34.036 29.006 36.050 29.502 21.328 29.000 23.536 32.566 32.252 32.164 34.868 28.662 32.504 29.836 30.502 27.676 28.176 25.186 41.930 29.514 27.102 28.396 26.504 30.102 27.974 27.496 27.972 22.954 35.304 31.204 34.962 30.704 30.114 29.538 35.196 29.070 31.178 26.880 33.216 28.350 29.964 32.498 30.952 27.468 29.916 34.424 33.780 33.522 31.100 31.090 32.372 36.410 26.344 28.992 31.714 40.272 36.884 29.360 31.724 29.556 33.570 23.646 25.596 37.730 39.736 30.830 37.396 31.002 33.848 31.134 27.514 26.812 39.794 36.450 35.498 .254 32.426 36.040 27.572 36.628 32.352 31.812 29.456 27.080 36.390 36.976 27.336 23.088 32.970 32.654 31.938 34.398 31.234 31.644 33.692 35.76 Statistics for Business 30.626 34.784 35.964 32.822 39.218 30.698 31.074 35.080 32.134 36.972 31.638 32.854 26.432 35.704 33.744 29.522 33.072 27.666 31.722 31.374 35.894 36.838 32.076 34.334 28.764 35.126 33.210 29.704 33.346 38.386 35.642 27.484 30.072 35.572 26.692 34.378 31.522 29.862 34.824 25.338 28.412 35.558 26.288 33.284 26.916 34.236 41.736 29.682 32.450 29.500 36.620 38.880 27.660 23.370 38.214 28.474 28.110 35.336 33.530 38.588 31.610 30.222 32.960 26.770 28.334 32.482 31.264 30.574 34.546 33.570 27.826 27.856 28.934 29.604 35.626 29.064 33.148 32.508 26.662 31.426 32.294 35.204 35.378 28.422 31.000 36.320 29.082 27.120 31.474 37.878 28.620 28.120 38.868 37.112 36.672 28.854 34.944 34.520 35.950 29.352 34.292 23.490 38.186 33.296 27.274 34.554 31.142 30.424 30.452 30.918 30.060 29.698 27.690 34.932 32.332 33.742 27.302 39.774 24.814 29.186 32.434 35.220 28.368 26.470 33.968 41.202 38.620 29.146 31.934 31.070 33.588 41.300 28.758 32.044 36.046 29.046 32.956 31.380 36.976 27.654 33.842 36.372 33.300 29.674 30.618 32.518 35.088 31.348 33.642 31.966 30.800 30.756 30.254 32.022 29.562 33.824 33.730 33.406 35.216 36.820 31.184 35.518 34.624 31.842 29.314 31.982 36.632 30.520 32.276 27.860 29.982 34.462 33.134 32.004 31.648 27.484 35.698 35.292 29.314 31.180 28.104 32.292 27.082 26.508 26.552 34.088 31.116 31.684 26.756 31.428 30.074 30.636 32.738 30.012 31.146 32.868 35.146 31.190 32.932 31.544 29.574 31.890 29.810 27.442 32.618 37.310 26.928 28.062 31.710 37.376 26.176 31.260 32.086 24.398 30.250 29.624 34.216 28.254 31.622 31.238 34.852 36.000 33.774 38.900 37.812 34.978 37.300 30.

000 30.364 37.490 37.600 31.336 33.874 29.986 36.048 26.882 33.702 31.742 34.926 25.960 31.938 22.358 34.048 30.842 30.990 32.930 33.124 31.626 34.942 36.180 30.902 30.062 32.414 26.766 32.538 29.562 31.302 34.060 35.704 36.460 34.786 30.318 38.454 28.258 31.578 32.342 27.260 26.340 26.510 34.024 36.298 34.220 32.256 29.084 34.570 30.856 28.236 35.780 37.036 28.382 31.022 30.398 34.730 30.440 33.508 34.852 30.368 27.858 29.Chapter 2: Characterizing and defining data 77 31.220 29.202 30.450 33.048 37.134 33.248 36.674 26.686 39.826 36.854 31.830 29.246 34.748 33.590 27.582 31.802 30.730 34.082 32.400 29.422 35.306 25.306 35.088 35.986 30.994 32.040 34.156 35.792 32.382 30.470 39.316 30.404 26.308 25.772 28.776 28.438 33.704 30.084 32.108 34.664 31.918 33.390 24.652 31.902 35.662 37.534 34.462 31.920 38.802 29.788 39.258 33.438 28.310 36.426 25.530 32.432 32.936 36.550 29.852 32.220 31.350 22.292 28.564 29.490 35.992 27.790 32.382 34.316 30.916 31.650 30.368 34.860 35.548 35.170 23.750 32.052 33.672 36.798 35.070 31.068 26.692 28.336 28.856 29.188 35.998 30.386 26.958 26.732 .502 40.100 26.334 30.870 34.530 24.880 28.592 28.028 32.606 33.658 42.868 37.472 26.926 29.462 35.762 33.616 27.178 32.056 32.872 39.836 42.786 25.144 35.584 39.598 34.774 33.216 29.224 24.452 28.206 29.358 29.330 29.858 30.144 24.344 33.562 29.554 32.502 30.680 31.040 35.460 37.992 28.814 28.514 29.520 27.606 31.560 35.430 43.688 34.160 33.834 29.504 36.786 33.092 25.844 28.094 34.690 27.160 33.746 32.664 28.824 25.642 32.520 30.664 29.830 31.502 27.656 36.636 32.552 33.388 27.842 29.762 29.

This page intentionally left blank .

Basic probability and counting rules 3 The wheel of fortune For many. Drinks are cheap. gambling casinos are exciting establishments.1 million people. The carpet patterns are busy so that you look at where the action is rather than looking at the floor. In 2004 in the United States. or blackjack. Poker is a particular growth area and some 18% of Americans played poker in 2004. which require no intelligence to operate. Today the gambling industry is run by respectable corporations instead of by the Mob and it is confident of winning public acceptance. which was a 50% increase over 2003. When there is a “win” coins drop noisily into aluminium receiving tray and blinking lights indicate to the world the amount that has been won. smartly dressed. When you want to go to the toilet you have to pass by rows of slot machines and perhaps on the way you try your luck! Gambling used to be a by-word for racketeering. so having “a few” encourages you to take risk. Throughout the casinos there are no clocks or windows so you do not see the time passing. or maybe free. Now it has cleaned up its act and is more profitable than ever. that means excluding those . or more than one-quarter of all American adults visited a casino. some 54. The dealers and servers are beautiful people. on average 6 times each. the United States’ 445 commercial casinos. and the roulette wheel have an air of mystery about them. The one-armbandits are colourful machines with flashing lights. who say very little and give an aura of superiority. The gaming rooms for poker. Together.

had revenues in 2004 of nearly $29 billion. Further. Spain.partouche. and spas a significant part of the service industry. gambling casinos. not including any from gambling dependent Nevada and New Jersey. hotels. 24 September 2005. A survey of 201 elected officials and civic leaders. http://www. resorts. . France. The Economist. And.fr. and Tunisia.1. consulted 27 September 2005. This makes the whole combination.2 1 2 The gambling industry. Just about all casinos are associated with hotels and restaurants and many others include resort settings and spas.80 Statistics for Business owned by Indian tribes.74 billion or almost 10% more than in 2003. let us not forget the famed casino in Monte Carlo. This is where statistics plays a role. Europe is no different. Las Vegas immediately springs to mind. it paid state gaming taxes of $4. Switzerland. Morocco. found that 79% believed casinos had had a positive impact on their communities. The company Partouche owns and operates very successful casinos in Belgium.

odds. If we are using percentages. The corollary to this is that there is a probability or risk of being incorrect. risk in system reliability. If the probability is 0% then there is absolutely no chance that an outcome will occur. which is the entire group in which we are interested. The following are the specific topics to be covered. but you were born in Austria. You will then be able to apply these concepts to practical situations. if you live in the United States. as there are mathematical relationships and rules that govern choices available. The probability is 100% that someday you will die – though hopefully at an age way above the statistical average! Between the two extremes of 0 and 1 something might occur or might not occur. 3 • Permutations of objects: Rule No. 4 • Combinations of objects: Rule No. Since we are extending our sample results beyond the data that we have measured. the probability of you becoming president is 0% – in 2006. 1 • Different types of events: Rule No. The meteorological office may announce that there is a 30% chance of rain Basic Probability Rules A principal objective of statistics is inferential statistics. Probability The concept of probability is the chance that something happens or will not happen. In statistical analysis the outcome of certain situations can be reliably estimated. We sample the opinion of 5. For example. 2 • Arrangement of different objects: Rule No. which is to infer or make logical decisions about situations or populations simply by taking and measuring the data from a sample. . ✔ ✔ ✔ Basic probability rules • Probability • Risk • An event in probability • Subjective probability • Relative frequency probability • Classical probability • Addition rules in classical probability • Joint probability • Conditional probabilities under statistical dependence • Bayes’ Theorem • Venn diagram • Application of a Venn diagram and probability in services: Hospitality management • Application of probability rules in manufacturing: A bottling machine • Gambling. this means that there is no guarantee but only a probability of being correct or of making the right decision. System reliability and probability • Series or parallel arrangements • Series systems • Parallel or backup systems • Application of series and parallel systems: Assembly operation.Chapter 3: Basic probability and counting rules 81 Learning objectives After you have studied this chapter you will understand basic probability rules. the current governor of California! At the top end of the probability scale is 100%. We use the information from this sample to infer conclusions about the population. and probability. This sample is taken from a population.500 of the electorate and we use this result to estimate the opinion of the population of 35 million. which means that it is certain the outcome will occur. then the scale is from 0% to 100%. In statistics it is denoted by the capital letter P and is measured on an inclusive numerical scale of 0 to 1. This is useful in decisionmaking since we can use these relationships to make probability estimates of certain outcomes and at the same time reduce risk. we are interested to know how people will vote in a certain election. 5. Under present law. Counting rules • A single type of event: Rule No. and counting rules.

000 and costs are £7. is risk. since Risk An extension of probability. the probability of an automobile driver aged between 18 and 25 years having an accident is considered greater than for people in higher age groups. However. For example. An event in probability In probability we talk about an event. Very often. For example if revenues are £10. If you obtain heads on the tossing of a coin. “Jim winning the slalom” would be an event. then “obtaining an A grade” would be an event. if there are 2. For example. Salesperson B may give only a 50% probability level of making that sale. There are no numbers involved. as he knows the client well. a single 22-year-old student what is the probability of him getting married next year? His response is 0%. and this particular situation has never occurred before. (Michael has never been married. You ask his friend John.000 or 0. However. If you buy one ticket in a fund raising raffle then you will either win or lose. one that has not been “fixed”. Salesperson A says that he is 80% certain of making a sale with a certain client. that is the situation is binomial. In this case the “value” on the outcome is more than monetary. These are qualitative responses. or perhaps the risk of killing yourself. often encountered is business situations. However that does not mean that there is a 50/50 chance of being right or wrong or a 50/50 chance of winning. you ask Michael. or afraid to take risks. If Susan wins a lottery.000 or a 99. In business we might invest in new technology and say that there is a 70% probability of increasing market share but this also might mean that there is a risk of losing $100 million. If Jim wins a slalom ski competition. With probability something happens or it does not happen. If you draw the King of Hearts from a pack of cards. If you obtain an A grade on an examination. Here. A manager who knows his employees well may be able to give a subjective probability of his department succeeding in a particular project. An event is the result of an activity or experiment that . To insurance companies. If you toss a fair-sided coin. you have a 50% chance of obtaining heads or 50% chance of throwing tails. This probability might differ from that of an outsider assessing the probability of success. “Susan winning the lottery” would be an event.000 (£10. what he thinks is the probability of Michael getting married next year and his response is 50%. is higher than those persons who are risk averse. Thus. In this case you risk having an accident. sometimes emotional.05% chance of winning and a 1. and simply based on the belief or the “gut” feeling of the person making the judgment.000 then it is sure that the gross profit is £3.000 tickets that have been sold you have only a 1/2. If you drink and drive the probability of you having an accident is high.82 Statistics for Business today. to the insurance company young people present a high risk and so their premiums are higher than normal. which also means that there is a 70% chance that it will not. Both are basing their arguments on subjective probability. which is qualitative. but also in our personal life.999/2.000).) Subjective probability may be a function of a person’s experience with a situation. then “drawing the King of Hearts” would be an event. or risk takers. Subjective probability One type of probability is subjective probability. the subjective probability of people who are prepared to take risks. or there are only two possible outcomes.95% chance of losing! has been carried out. when we extend probability to risk we are putting a value on the outcomes. then “obtaining heads” would be an event. The opposite of probability is deterministic where the outcome is certain on the assumption that the input data is reliable.000 £7. If you select a light bulb from a production lot and it is defective then the “selection of a defective light bulb” would be an event.

which is composed of the individual cards according to Table 3. We know in advance that the probability of drawing an Ace of Spades. is 1/52 or 1. Thus. 2. is classical probability. we know in advance that the probability of throwing a 5 or any other number is .33%. and childcare.000 married couples under study. data taken from a certain country indicate that in a sample of 3. when we developed a relative frequency histogram for the sales data given in Figure 1.000 to £105. Relative frequency probability A probability based on information or data collected from situations that have occurred previously is relative frequency probability.1 Suit Hearts Clubs Spades Diamonds Total Composition of a pack of cards with no jokers. 4. Classical probability is also known as simple probability or marginal probability and is defined by the following ratio: Classical probability Number of outcomes where the event occurs 3(i) Total number of possible outcomes s In order for this expression to be valid. Relative frequency probabilities have use in many business situations. one-third were divorced within 10 years of marriage.Chapter 3: Basic probability and counting rules 83 Table 3. Here. then from this Figure 1. In collecting data for determining relative frequency probabilities. or 6. the probability of the outcomes. the probability of being divorced before 10 years of marriage is 1/3 or 33. For example. the numbers 1.2. Similarly in the throwing of one die there are six possible outcomes. the data collected is sometimes referred to as historical data as the information after it has been collected is history.000.92%. if we assume that future conditions are similar to past events. We have already seen this in Chapter 1. Total Ace Ace Ace Ace 4 1 1 1 1 4 2 2 2 2 4 3 3 3 3 4 4 4 4 4 4 5 5 5 5 4 6 6 6 6 4 7 7 7 7 4 8 8 8 8 4 9 9 9 9 4 10 10 10 10 4 Jack Jack Jack Jack 4 Queen Queen Queen Queen 4 King King King King 4 13 13 13 13 52 the former are more optimistic or gung ho individuals. the reliability is higher if the conditions from which the data has been collected are stable and a large amount of data has been measured. or in fact any one single card. Also. 5. This demographic information can then be extended to estimate needs of such things as legal services. Classical probability A probability measure that is also the basis for gambling or betting. Again. the number of cards in the pack. and thus useful if you frequent casinos. as defined by the numerator (upper part of the ratio) must be equally likely. new homes. The total number of possible outcomes is 52. we can say that in this country. let us consider a full pack of 52 playing cards. 3.2 we could say that there is a 15% probability that future sales will lie in the range of £95. For example. on the assumption that future conditions will be similar to past conditions. Relative frequency probability is also called empirical probability as it is based on previous experimental work.1.

such as the tossing of coin. note its face value. A mutually exclusive event means that there is no connection between one event and another. is by equation 3(ii).84 Statistics for Business 1/6 or 16.85% Or we can look at it another way: P(Ace) 4 52 7. or a Spade. In the tossing of a coin there are only two possible outcomes. In many chance situations. with replacement after the first draw. AS. or the Queen of Hearts. Thus from equation 3(iii) the probability of drawing an Ace or a Spade is. QH. the Ace of Spades. and then put it back into the pack P(AS or QS ) 1 52 1 52 1 26 3. Thus an Ace and a Spade are not mutually exclusive events. Further. or to reduce the probability of drawing an Ace. My Canadian cousins had three girls and they really wanted a boy. If we do not replace the first card that is withdrawn. and this first card is neither the Ace of Spades. They tried again thinking after three girls there must be a higher probability of getting a boy.0192 Addition rules in classical probability In probability situations we might have a mutually exclusive event. P( AS or QS ) 1 52 1 51 0.69% P(Spade) 13 52 25. Thus the probability of obtaining heads or tails is 1⁄ 2 or 50%. or B) P(A) P(B) 3(ii) That is. These illustrations of classical probability are also referred to as a priori probability since we know the probability of an event in advance without the need to perform any experiments or trials. this means that it is possible for both events to occur. equation 3(ii) for mutually exclusive events must be adjusted to avoid double accounting.67%. but not both. P(Ace or Spade) 4 52 13 4 13 − * 52 52 52 16 52 30. if you obtain heads on one toss of a coin this event will have no impact of the following event when you toss the coin again.77% 17 1 − 52 52 For example. that is.00% . heads or tails. in a pack of cards. When two events are mutually exclusive then the probability of A or B occurring can be expressed by the following addition rule for mutually exclusive events P(A. If we consider for example. This time they had twins – two girls! The fact that they had three girls previously had no bearing on the gender of the baby on the 4th trial. a slightly higher probability than in the case with replacement.88% 0.0196 3. In this case. If two events are non-mutually exclusive. or B) P(A) P(B) P(AB) 3(iii) Here P(AB) is the probability of A and B happening together. obtaining heads on the tossing of a coin is mutually exclusive from obtaining tails since you can have either heads. each time you make the experiment. For example. They exhibit statistical independence. equation 3(ii) is adjusted to become the following addition rule for nonmutually exclusive events P(A. or tails. or the Queen of Hearts then the probability is given by the expression. Replacement means that we draw a card. the probability of drawing the Ace of Spades. everything resets itself back to zero. Thus. by the chance we could draw both of them together. the probability of drawing either an Ace or a Spade from a deck of cards. then the event Ace and Spade can occur together since it is possible that the Ace of Spades could be drawn.

78% 16.78% 2. 1st die 2nd die Total throw 1 6 7 2 5 7 3 4 7 4 3 7 5 2 7 6 1 7 Joint probability The probability of two or more independent events occurring together or in succession is joint probability.2 Possible combinations for obtaining 7 on the throw of two dice. a 4 and 3.704 0. the combination in the 2nd column. Thus. a 5 and 2. Consider for example again in gambling where we are using one pack of cards. the probability of obtaining a 7 on the throw of two dice is 6/36 16. P(A) is the marginal probability of A occurring and P(B) is the marginal probability of B occurring. the joint probability for throwing a 3 and 4 together. is from joint probability 1 1 * 6 6 1 36 2. the probability that all 6 can occur is determined as follows from the addition rule 2. two dice are thrown together.92% 1.78%. P(Ace or a Spade) 7.2.92% of drawing the Ace of Spades once in a single drawing.037% The chance of throwing a 2 and a 5 together.3. the number of possible outcomes where the number 7 occurs is six.78% 2.77% to avoid double accounting.78% 2. Number of outcomes where the event occurs Total number of possible outcomes t Here.Chapter 3: Basic probability and counting rules 85 P(Ace of Spades) 1 52 1. Here the value of 0. the .92%. This is calculated by the product of the individual marginal probabilities P(AB) P(A) * P(B) 3(iv) combination in the 1st column.67% This is the same result using the criteria of classical or marginal probability of equation 3(i).78% 2.78% Similarly.69 25. and a 6 and 1 together is always 2. Assume in another gambling game.037% for drawing the Ace of Spades twice in two draws is much less than the marginal productivity of 1. The classical or marginal probability of drawing the Ace of Spades from a pack is 1/52 or 1. In order for the total count to be 7. The total number of possible outcomes are 36 by the joint probability of 6 * 6.78% 2. the various combinations that must come up together on the dice are as given in Table 3. The probability of drawing the Ace of Spades both times in two successive draws with replacement is as follows: 1 1 * 52 52 1 2.92 Table 3.67% In order to obtain the number 5.78% Here P(AB) is the joint probability of events A and B occurring together or in succession.00 30. is from joint probability 1 1 * 6 6 1 36 2. From classical probability we know that the chance of throwing a 1 and a 6 together. Thus. the combinations that must come up together are according to Table 3. and the total number obtained is counted. The joint probability is always less than the marginal probability since we are determining the probability of more than one event occurring together in our experiment.

5. we know in advance that the probability of obtaining a 5 is 4/36 or 11.67%.10% as shown in the Figure 3. most gamblers lose! Conditional probabilities under statistical dependence The concept of statistical dependence implies that the probability of a certain event is dependent on the occurrence of another event.1. There are four different formats.10. Consider the lot of 10 cubes given in Figure 3. 0. Figure 3. In gambling with slot machines or a onearm-bandit.1667 * 0.1667 * 0.10 0.78% Probability 0.10% 11.3 Possible combinations for obtaining 5 on the throw of two dice. This shows that we have one cube that is dark green and dotted. two cubes are light green and striped. In this case the joint probability is 0.0046 0.0010 0.78% 2.4 according to the configuration of each cube. often the winning situation is obtaining three identical objects on the pull of a lever according to Figure 3. Probability of obtaining the same three fruits on a one-arm-bandit where there are 10 different fruits on each of the three wheels.3.1. Assume that we select a cube at random from the lot. three cubes are dark green and striped. there are 10 possible events and the probability of selecting any one cube at random from the lot is 10%. Random means that each cube has an equally chance of being chosen. The probability of winning is joint probability and is given by.10 * 0. The possible outcomes are shown in Table 3. Thus again this is a priori probability since in the throwing of two dice. These formats are also shown in Figure 3. One cube is dark green and dotted.10 * 0. and two cubes that are light green and striped.2. and four cubes are light green and dotted. If there are six different objects on each wheel.1 Joint probability.001 0.10 * 0.1667 0. P(A1 A2 A3) P(A1) * P(A2) * P(A3) 3(v) This low value explains why in the long run. then the marginal probability of obtaining one object is 1/6 16. three cubes that are dark green and striped. four cubes that are light green and dotted. Alternatively.12% (actually 11.11% if we round at the end of the calculation) Again from marginal probabilities this is 4/36 11.86 Statistics for Business Table 3. where we show three apples.78% 2. but each wheel has the same objects. 2.11%. As there are 10 cubes. P (ABC) P (A ) * P (B ) * P (C ) 1st die 2nd die Total throw 1 4 5 2 3 5 3 2 5 4 1 5 The probability that all four can occur is then from the addition rule.10 0.10 * 0. .11% (see also the following section counting rules).46% If there are 10 objects on each wheel then the marginal probability for each wheel is 1/10 0. Then the joint probability of obtaining all three objects together is thus.78% 2. this information can be presented in a two by two cross-classification or contingency table as in Table 3.

.4 Possible outcomes of selecting a coloured cube. The probability of the cube being striped is 5/10 or 50%. Dark green Dotted Striped Total 1 3 4 Light green 4 2 6 Total 5 5 10 ● Probability of occurrence with the total ● ● The probability of the cube being light green is 6/10 or 60%. Event 1 2 3 4 5 6 7 8 9 10 Probability (%) 10 10 10 10 10 10 10 10 10 10 Colour Dark green Dark green Dark green Dark green Light green Light green Light green Light green Light green Light green Design Dotted Striped Striped Striped Striped Striped Dotted Dotted Dotted Dotted Figure 3.2 Probabilities under statistical dependence.3 Probabilities under statistical dependence.5 Cross-classification table for coloured cubes.Chapter 3: Basic probability and counting rules 87 Figure 3. The probability of the cube being dark green is 4/10 or 40%. 10 cubes of the following format Table 3. Light green and dotted Dark green and striped Light green and striped Dark green and dotted 40% 10% 30% 20% 100% Table 3.

33% .67%. Using the relationship from equation 3(vi) and referring to Table 3. given light green) 1 3 13. If we select a striped cube from the lot what is the probability of it being light green? The condition is that we have selected a striped cube.88 Statistics for Business The probability of the cube being dotted is 5/10 or 50%. is equal to the joint probability of B and A happening together. In addition. Based on last year’s performance you believe that there is a high probability they have a chance of moving to the top of the league this year. four are dotted.00% Now. ● P(light green. as the current season moves on Newcastle loses many of the games even on their home turf. what is the probability of it being dotted? The condition is that we have selected a light green cube. P(striped. or in succession. divided by the marginal probability of A. certain probabilities may be revised to give posterior probabilities (post meaning afterwards). on the condition that A has occurred. given light green) P(striped and light green) P(light green) 2/10 6/10 P(dotted. two of their P(dotted and light green) P(light green) 4/10 6/10 2 3 33. Bayes’ Theorem The relationship given in equation 3(vi) for conditional probability under statistical dependence is attributed to the Englishman. P(striped. P(striped. given light green) P(dotted. if we select a light green cube from the lot. given light green) 4 2 1. This conditional probability under statistical dependence can be written by the relationship. It illustrates that if you have additional information. given dark green) P(dotted.00 6 6 The relationship.33% . P(B | A) P(BA) P(A) 3(vi) The relationship. The probability of the cube being dark green and dotted is 1/10 or 10% The probability of the cube being light green and dotted is 4/10 or 40%. There are six light green cubes and of these. The probability of the cube being light green and striped is 2/10 or 20%.00% ● ● ● P(dark green. thus the probability is 2/5 or 40%. However.5. There are five striped cubes and of these two are light green. Consider that you are a supporter of Newcastle United Football team. given dark green) 3 1 1. The Reverend Thomas Bayes (1702–1761) and is also referred to as Bayesian decision-making. The probability of the cube being dark green and striped is 3/10 or 30%.00 4 4 This is interpreted as saying that the probability of B occurring. given striped) ● P(light green and striped) P(striped) 2/10 5/10 2 5 40. given dotted) P(dark green and dotted) P(dotted) 1/10 5/10 1 5 20. and so the probability is 4/6 or 66. or based on the fact that something has occurred.

purchase the house you visited. and health spa management. In these cases. their areas will not overlap as shown in Figure 3. the sum of occupied areas is 2 boxes or 2/52 3. another way of looking at Bayes’ posterior rule is applying it to the often-used phrase of Hamlet. Each of the cards. is a useful 3 Based on a real situation for the Author in the 1980s. or 100%. Each card occupies 1 box and when we are considering two cards. Thus. In one particular year there are 80 students enrolled in the programme. If two events. your son starts smoking heavily. your son’s life expectancy drops as the risk of contracting life-threatening diseases such as lung cancer increases. Or. With this new information. or new elections in Algeria make the political situation in the country unstable with a security risk for the population. as time moves on. based on this posterior information. Here again the number of boxes is 52. A and B. and a particular outcome of an event is represented by part of this surface. However.77%. His life expectancy is in the high 70s.5. procrastination may be the best approach and. This programme covers hotel management.Chapter 3: Basic probability and counting rules best players have to withdraw because of injuries. named after John Venn an English mathematician (1834–1923). if we wait until new information comes along – Company A’s financial accounts turns out are inflated. based on these new events. Here the number of boxes is 52. tourism. the house you thought about buying turns out is on the path of the construction of a new auto route. Thus. . or take the high-paying job you were offered in Algiers. Venn diagram A Venn diagram.3 However. are not mutually exclusive their areas would overlap as shown in Figure 3. 28 in hotel management. “he who hesitates is lost”. 89 Application of a Venn diagram and probability in services: Hospitality management A business school has in its curriculum a hospitality management programme. Thus. However.85%. are mutually exclusive. The phrase implies that we should quickly make a decision based on the information we have at hand – buy stock in Company A. which is the entire sample space. the food industry.4. Algeria. the probability of Newcastle United moving to the top of the league has to be revised downwards. casino operation. This is a visual representation for a pack of cards using a rectangle for the surface. the probabilities are again revised downwards. way to visually demonstrate the concept of mutually exclusive and non-mutually exclusive events. which is entire sample space. The programme includes a specialization in hotel management and tourist management and for these programmes the students spend an additional year of enrolment. “he who hesitates comes out ahead”. This information is representative of the general profile of the hospitality management programme. Assume that your 18-year-old son is considered for a life insurance. Of these 80 students. Take into account another situation where insurance companies have actuary tables for the life expectancy of individuals. one card is common to both events and so the sum of occupied areas is 17 1 boxes or 16/52 30. and 5 specializing in both tourist and hotel management. if Bayes’ rule is correctly used it implies that it maybe unnecessary to collect vast amounts of data over time in order to make the best decisions based on probabilities. 13 Spades and 4 Aces would normally occupy 1 box or a total of 17 boxes. A surface area such as a circle or rectangle represents an entire sample space. If two events. 15 elect to specialize in tourist management.

4 Venn diagram: mutually exclusive events.85% 100% Figure 3.90 Statistics for Business Figure 3. one card is common to both events Sum of occupied areas 17 1 boxes or 16/52 30.77% . Ace H Ace S 2 3 4 5 6 7 8 9 10 J Q K Ace D Ace C Number of boxes 52 which is entire sample space Each card would normally occupy 1 box 17 boxes However. 1st Card 2nd Card Number of boxes 52 which is entire sample space Each card occupies 1 box Sum of occupied areas 2 boxes or 2/52 3.5 Venn diagram: non-mutually exclusive events.

50% This can also be expressed by the counting rule equation 3(iii): P(H or T) P(H) P(T) P(HT) 28 15 5 47.6. The two circles overlap indicating the 5 students who are specializing in both hotel and tourist management. There are (10 5) in tourist management in the circle on the right. What is the probability that a random selected student is in tourist management? From the Venn diagram this is the total in tourist management divided by total sample space of 80 students or.25% 6. 2. and from the Venn diagram this is 5/28 17. which leaves (80 23 5 10) 42 students as indicated not specializing.6 Venn diagram for a hospitality management programme. P(T) 5 10 80 18. What is the probability that a random selected student is in hotel or tourist management? From the Venn diagram this is.00% 5. P(T|H) P(TH) P(H) 5/80 28/80 5 28 17. What is the probability that a random selected student is in hotel management? From the Venn diagram this is the total in hotel management divided by total sample space of 80 students or.50% 5 80 80 80 3.86%. this is also written as.Chapter 3: Basic probability and counting rules 91 Figure 3. There are (23 5) in hotel management shown in the circle (actually an ellipse) on the left. P(H or T) 23 5 10 80 47. The rectangle is the total sample space of 80 students. Given a student is specializing in hotel management. From equation 3(vi). Hotel management 23 5 Tourist management 10 42 1.75% 4.86% . what is the probability that a random selected student is also specializing in tourist management? This is expressed as P(T|H). Illustrate this situation on a Venn diagram The Venn diagram is shown in Figure 3. P(H) 23 5 80 35. What is the probability that a random selected student is in hotel and tourist management? From the Venn diagram this is P(both H and T) 5/80 6.

Given a student is specializing in tourist management.01 0.000 * 2% 200 ● There are 98% of bottles filled correctly or 10. What are joint events for this situation? There are four joint events: ● An overfilled bottle and correctly capped ● An overfilled bottle and incorrectly capped ● A normally filled bottle and correctly capped ● A normally filled bottle and incorrectly capped. From equation 3(vi).33%. If the analysis were made looking at a sample of 10.0098 0.98% ● By the addition rule. Even if a bottle is filled correctly. then still 1% of the bottles are not properly capped.702 ● Thus. 3.852 ● Thus. P(H|T) P(HT) P(T) 5/80 15/80 5 15 33. ● Joint probability of a bottle being overfilled and faulty capped is 0.5% ● Joint probability of a bottle filled normally and faulty capped is (1 0. this is also written as. two major problems that occur are overfilling and caps not fitting correctly on the bottle top.6.000 bottles ● There are 2% bottles overfilled or 10. Further past data shows that if a bottle is overfilled then 25% of the bottles are faulty capped as the pressure differential between the bottle and the capping machine is too low. Application of probability rules in manufacturing: A bottling machine On an automatic combined beer bottling and capping machine.0050 0.48%. 25% are faulty capped or 200 * 25% 50 ● Thus bottles overfilled but correctly capped is 200 50 150 ● Bottles filled correctly but 1% are faulty capped or 9. What is the percentage of bottles that will be faulty capped and thus have to be rejected before final packing? Gambling. 4. What are the four simple events in this situation? The four simple events are: ● An overfilled bottle ● A normally filled bottle ● An incorrectly capped bottle ● A correctly capped bottle. and probability Up to this point in the chapter you might argue that much of the previous analysis is related to gambling and then you might say.02) * 0. 2. A bottle overfilled and faulty capped. This is developed as follows. From past data it is known that 2% of the bottles are overfilled. 1.25 0.33% Here there are two conditions where a bottle is rejected before packing.800 98 9. A bottle normally filled but faulty capped.92 Statistics for Business 7. how would this information appear in a cross-classification table? The cross-classification table is shown in Table 3.000 bottles. a bottle is faulty capped if it is overfilled and faulty capped or normally filled and faulty capped 0.852 148 1.800 * 1% 98 ● Thus filled correctly and correctly capped is 9.702 150 9. all bottles incorrectly capped is 10. odds. “but the business . bottles correctly capped is 9.0098 0.02 * 0.48% of the time.800 ● Of the bottles overfilled.000 * 98% 9.0148 1. what is the probability that a random selected student is also specializing in hotel management? This is expressed as P(H|T).0050 0. and from the Venn diagram this is 5/15 33.000 9. ● Sample size is 10.

The upper scheme shows a purely series arrangement and the lower a parallel arrangement.7. This is a general structure. as is indicated by the Box Opener. A system includes all the interacting components or activities needed for arriving at an end result or product. which contains n components in the case of a product. or n activities for processes. This service-related activity represents a non-negligible part of our economy! In risk. Although the odds depend on probability.702 150 9. or individual. This can be expressed mathematically as. Although the odds are related to probability they are a way of looking at risk. The value n can take on any integer value. shown in the upper diagram of Figure 3. Component 3. gambling. service. it means that for a system to operate we have to pass in sequence through Component 1. process. in order to produce the required output. The odds of winning are the ratio of the chances of losing to the chances of winning. Component 2. In the supply chain of a firm for example. We are confronted daily with gambling through government organized lotteries.800 200 10. In the system the reliability is the confidence that we have in a product.000 Total System Reliability and Probability Probability concepts as we have just discussed can be used to evaluate system reliability.Chapter 3: Basic probability and counting rules 93 Table 3. Volume Number that fit Right amount Overfilled Total 9. world is not just gambling”.852 Capping Number that does not fit 98 50 148 9. Thus the odds of obtaining the number 7 are 5 to 1. Generally. and gambling casinos. gambling. or whether the packing machines operate without breaking down. or stopping. reliability might be applied to whether the trucks delivering raw materials arrive on time. Thus the probability of not obtaining the number 7 is 30 out of 36 throws or 5 out of 6. the more components or activities in a product or a process. 5/6 1/6 5 1 Series or parallel arrangement A product or a process might be organized in a series arrangement or parallel arrangement as illustrated schematically in Figure 3. it is the odds that matter when you are placing a bet or taking a risk! .6 Cross-classification table for bottling machine. work team. whether the suppliers produce quality components. The odds of drawing the Ace of Spades from a full pack of cards are 51 to 1. such that we can operate under prescribed conditions without failure. or betting we refer to the odds of wining. and eventually to Component n. then the more complex is the system and in this case the greater is the risk of failure. or 1 out of 6. Earlier we illustrated that the probability of obtaining the number 7 in the tossing of two dice was 6 out of 36 throws. Alternatively a system may be a combination of both series and parallel arrangements. Series systems In the series arrangement. whether the operators turn up for work. buying and selling stock.7. The probability is the number of favourable outcomes divided by the total number of possible outcomes. Our capitalistic society is based on risk. That is true but do not put gambling aside. or unreliability. and as a corollary.

Assume that component R1 has a reliability of 99%. then the system fails. to a resistor (Component 3). In the electric heater example. when an electric heater is operating the electrical current comes from the main power supply (Component 1). For the electric heater. according to the following relationship: RS R1 * R2 * R3 * R4 * … Rn 3(vii) Here R1. the electric cable. R3. System connected in series X Component 1 Component 2 Component 3 Component n Y System connected in parallel (backup) Component 1 X Component 2 Y Component 3 Component n For example. RS. or in the system they are interdependent.98 * 0. Binomial means there are only two possible outcomes such as yes or no. R2. from which heat is generated. if the power supply fails. R2 a reliability of 98%. or the value of R.7 with three components. the complete electric heating system does depend on all the components functioning. through a cable (Component 2). Consider the system between point X and Y in the series scheme of Figure 3. the main power supply. If one component fails. The system reliability is then: RS R1 * R2 * R3 0.94 Statistics for Business Figure 3.99 * 0. However. n. will be 100% (nothing is perfect) and may have a value of say 99%. is the joint probability of the number of interacting components. or the resistor is broken then the heater will not function. represent the reliability of the individual components expressed as a fraction or percentage. etc.97 94. or the cable is cut. The reliability.7 Reliability: Series and parallel systems. or it will fail 1% of the time (100 99). The relationship in equation 3(vii) assumes that each component is independent of the other and that the reliability of one does not depend on the reliability of the other.11% . and R3 a reliability of 97%. This is a binomial relationship since the component either works or it does not. This means that a component will perform as specified 99% of the time. true of false. and the resistor are all independent of each other.9411 0. The reliability of a series system.

decreases rapidly with the number of components. for various values of the individual component reliability from 100% to 95%. in a series arrangement of multiple components. then as shown in Table 3. RS.6. RS 1 (1 R)n 3(xii) Consider the three component system in the lower scheme of Figure 3. where n is the number of components RS Rn 3(viii) RS Probability of R1 working Probability of R2 working * Probability of needing R2 The probability of needing R2 is when R1 is not working or (1 R1). Reorganizing equation 3(ix) RS RS RS RS R1 1 1 1 R2 R1 (1 (1 R2 * R1 R2 R1 R2 * R1 R2 1 R2 * R1) R2) 3(x) R1)(1 If there are n components in a parallel arrangement then the system reliability becomes RS 1 (1 (1 R1)(1 R2)(1 R4) … (1 Rn) R3) 3(xi) Parallel or backup systems The parallel arrangement is illustrated in the lower diagram of Figure 3.76% for 200 components.7 the system reliability drops from 94. the reliability of the system. However.35 50 36.26 200 1. ….76 In a situation where the components have the same reliability then the system reliability is given by the following general equation. The equation can be interpreted as saying that the more the number of backup units. or eventually Component n.71 25 60. R2.8 gives a family of curves showing the system reliability. When the backup components of quantity. then the greater is the system reliability.Chapter 3: Basic probability and counting rules 95 Table 3.39 10 81.12 5 90. where R1. Rn represent the reliability of the individual components. Component 2. Further. Assume that we have two components in a parallel system. Component 3. Number of components System reliability (%) 1 98. RS R1 R2(1 R1) 3(ix) Note that as already mentioned for joint probability. Figure 3. R1 the main component and R2 the backup or auxiliary component. is then given by the relationship. have an equal reliability.7 between point X and Y with the principal component R1 having a . n. The reliability of a parallel system. then the system reliability is given by the relationship. Thus.42 100 13. the system reliability RS is always less than the reliability of the individual components. Further.12% for three components to 1. this increase of reliability comes at an increased cost since we are adding backup which may not be used for any length of time. This illustrates that in order for equipment to operate we can pass through Component 1. to give a more complete picture. These curves illustrate the rapid decline in the system reliability as the number of components increases.00 3 94. For example assume that we have a system where the average reliability of each component is 98%.7 System reliability for a series arrangement.

R2 the first backup component having a reliability of 98%.0 70.5 99.0 10. this is a reliability greater than the reliability of the individual components.96 Statistics for Business Figure 3. n.5 96.0 Individual component reliability (same for each component) (%) reliability of 99%. Hospitals have back up energy systems in case of failure of the principal power supply. RS RS RS 1 1 1 (1 (1 R1)(1 0.9998 99.8 System reliability in series according to number of components.99)(1 R2) 0.0 95. Of course.02 0.0 100. Here the curves give the reliability with no backups (n 1) to three backup components (n 4). Since the components are in parallel they are called backup units. a system reliability greater than with using a single generator.9.999994 0. The more the number of backup units. the greater is the cost.0 96.994% RS RS 1 1 0.0 n 3 n n n 10 5 1 80.98)(1 0.98) Again. however.0 30. R2 then the system reliability is.99)(1 R2)(1 R3) 0.0 20.0 0. we would always want close to 100% reliability.5 97.0 n 200 n 100 n 50 n 25 99.01 * 0.97) 99.000006 That is. with greater reliability.5 95.98% 0.0 98. and R3 the second backup component having a reliability of 97% (the same values as used in the series arrangement). 100.0 90.0 97. ideally. If we only had the first backup unit.5 98.0 50. Most banks and other firms have backup computer systems containing client data should one system .0002 0. The system reliability is then from equation 3(xi). RS RS 1 1 (1 (1 R1)(1 0.0 System reliability (%) 60.0 40. then the greater is the system reliability as illustrated in Figure 3.

respectively.4 Aeroplanes have backup units in their design such that in the eventual failure of one component or subsystem there is recourse to a backup.0 100. We were told of four possible escape routes to get out of the path. Quentin Falavvier.0 30. 100.0 3 n 4 80. Without such a system. 18 November 2005. The following is an application example of mixed series and parallel systems. Application of series and parallel systems: Assembly operation In an assembly operation of a certain product there are four components A. Petersburg Florida when hurricane Charlie was about to land.0 n 90. 95%. 90%.0 n 60.0 fail.0 80.0 1 50.9 System reliability of a parallel or backup system according to number of components. and Portugal. my wife 4 and I were in a motor home in St.0 30.Chapter 3: Basic probability and counting rules 97 Figure 3.0 40. Near Lyon. although at a much reduced efficiency.0 System reliability (%) n 2 70. The emergency services had designated several backup exit routes – thankfully! When backup systems are in place this implies redundancy since the backup units are not normally operational.0 90. . B. The possible ways After a visit to the IKEA distribution platform in St. The IKEA distribution platform in Southeastern France has a backup computer in case its main computer malfunctions. and D that have an individual reliability of 98%. and 85%. In August 2004.0 50. C.0 Reliability of component (same for each) (%) 40. For example a Boeing 747 can fly on one engine. IKEA would be unable to organize delivery of its products to its retail stores in France. Spain. n.0 60. To a certain extent the human body has a backup system as it can function with only one lung though again at a reduced efficiency.0 70. France.

7122) 0.9945 99.10–3.12 Assembly operation: Arrangement No. A 98% B 95% C 90% D 85% Figure 3. in parallel with the reliability in the second row.68%. 2 Here this is a series arrangement in the top row in parallel with an assembly in the bottom row. 1 Here this is completely a series arrangement and the system reliability is given by the joint probability of the individual reliabilities: ● C 90% D 85% ● Reliability is 0.85 0.90 * 0.13 Assembly operation: Arrangement No.9800) 0. 2.98 Statistics for Business Figure 3. Probability of system failure is (1 0. joint probability of the individual reliabilities in the top row. 3. ● Arrangement No.55%.13.10 Assembly operation: Arrangement No.98 * 0. Probability of system failure is (1 0.85 0.22%. The system reliability is calculated by first the ● ● Reliability of top row is 0.45%.11 Assembly operation: Arrangement No.0055 0. Arrangement No. Determine the system reliability of the four arrangements. A 98% B 95% Figure 3.7268)*(1 0. Reliability of system is 1 (1 0. 4.78%.9945) 0. B 95% C 90% D 85% C 90% D 85% A 98% Figure 3.2878 28.7268 72.7122 71. .90 * 0. 1.95 * 0. A 98% B 95% of assembly the four components are given in Figures 3.95 * 0.

In this case the characteristic probability. In tossing a coin just 3 times it is impossible to say what will be the possible outcomes.62%. Probability of system failure is (1 0.10%. However. The following gives five different counting rules. A single type of event: Rule No. However. or results.000 times. 1 Heads Heads Heads 2 Heads Heads Tails 3 Heads Tails Heads 4 Tails Heads Heads 5 Tails Tails Tails 6 Tails Tails Heads 7 Tails Heads Tails 8 Heads Tails Tails Arrangement No. and the number of trials. n. k. or experiments. Joint reliability of bottom row is 0. of various types of experiments.90 * 0. The usefulness of counting rules is that they can give you a precise answer to many basic design or analytical situations. ● ● ● ● Joint reliability of top row is 0. 4 Here we have two units each in series and then the combination in parallel. the closer the result will be to the characteristic probability. the larger the number of trials.9500) * (1 0.85 0.8500) 0. 1 in the three tosses of the coin. when systems are connected in parallel. In summary.000015 0. is 3 and the number of events. The events.8 Outcome First toss Second toss Third toss Possible outcomes of the tossing of a coin 8 times.98 * 0. 3 Here we have four units in parallel and thus the system reliability is. Suppose for example that a coin is tossed 3 times. In Excel we use [function POWER] to calculate the result.9838) 0.8 gives the possible outcomes of the coin toss experiment. For example as shown for throw No.0015%. obtaining heads or tails are mutually exclusive since you can only have heads or tails in one throw of a coin. heads could be obtained each time.95 0.9800) * (1 0. Table 3. Then the number of trials. or experiments is n. Reliability of system is 1 (1 0. we can reasonably estimate that we will obtain approximately 500 heads and 500 tails. and then the third heads.9983 98. Counting Rules Counting rules are the mathematical relationships that describe the possible outcomes.999985) 0. then the total possible outcomes of single types events are given by kn.50%.Chapter 3: Basic probability and counting rules 99 Table 3. there is no probability involved.38%. Probability of system failure is (1 0. or 8. is 2 since only heads or tails are the two possible events. you perform the analysis.9310 93.0162 1. The collectively exhaustive outcome is 23. the reliability is the highest and the probability of system failure is the lowest. P(x) is 50% since there is Arrangement No. Alternatively as shown for throw No. or trials.900) * (1 0.7650 76. ● ● 1 (1 0.7650) 0. The counting rules are in a way a priori since you have the required information before . That is. 6 the first two tosses could be tails. 1 If the number of events is k.9310) * (1 0.9985%. if there are many tosses say a 1.999985 99.

1st die 2nd die Total of both dice an equal chance of obtaining either heads or tails. the letters A to Z. k3 possible events on the 3rd trial. the first letter can be any letter from A to Z. (This was the licence plate number of my first car. two dice are used. that I owned as a student in the 1960s the time of the Beatles – la belle époque!) In this format there are three numbers. 1st die 2nd die Total of both dice Throw No. and the third. This idea is further elaborated in the law of averages in Chapter 4. 4. Then the total possible different outcomes are 6 * 6 or 36. then the total possible outcomes of different events are calculated by the following relationship: k1 * k2 * k3 … kn 3(xiii) Suppose in gambling. 1 1 1 2 13 1 3 4 25 1 5 6 2 2 1 3 14 2 3 5 26 2 5 7 3 3 1 4 15 3 3 5 27 3 5 8 4 4 1 5 16 4 3 7 28 4 5 9 5 5 1 6 17 5 3 8 29 5 5 10 6 6 1 7 18 6 3 9 30 6 5 11 7 1 2 3 19 1 4 5 31 1 6 7 8 2 2 4 20 2 4 6 32 2 6 8 9 3 2 5 21 3 4 7 33 3 6 9 10 4 2 6 22 4 4 8 34 4 6 10 11 5 2 7 23 5 4 9 35 5 6 11 12 6 2 8 24 6 4 10 36 6 6 12 1st die 2nd die Total of both dice Throw No.9 gives the 36 possible combinations. or the number of licence plates is 17. Table 3. Note. Assume that the format for a licence plate is 212TPV. Similarly. the numbers from 0 to 9. 3. For letters. or 6.14. and the same for the third. 5. Possible outcomes of the tossing of two dice. Thus the outcome is n * P(x) or 1.100 Statistics for Business Table 3. k2 possible events on the 2nd trial. there are 10 possible outcomes. and kn possible events on the nth trial. Similarly. in England. 2. the possible events from throwing the second die are also six. the same for the second. The possible events from throwing the first die are six since we could obtain the number 1. an Austin A40. For numbers. Consider another example to determine the total different licence plate registrations that a country or community can possibly issue.9 Throw No. Different types of events: Rule No.67% (6/36).566.000 on the assumption that 0 is possible in the first place . Thus the total possible different combinations. there are 26 possible outcomes. followed by three letters.000 * 50% 500. 2 If there are k1 possible events on the 1st trial or experiment. the same for the second letter. The relative frequency histogram of all the possible outcomes is shown in Figure 3. Thus the first digit of the licence plate can be the number 0 to 9. This is the same value we found in the previous section on joint probabilities. that the number 7 has the highest possibility of occurring at 6 times or a probability of 16.

18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 16.818.10 Possible arrangement of three different colours. but in the factorial relationship. where. In Excel we use [function FACT] to calculate the result. Note. Permutations of objects: Rule No.818.14 Frequency histogram of the outcomes of throwing two dice.11 8.000 9 * 10 * 10 * 26 * 26 * 26 15.33 8.10 gives these six possible arrangements. 3 In order to determine the number of ways that we can arrange n objects is n!. 1 2 3 4 5 6 Arrangement of different objects: Rule No. or n factorial. For example. then the number possible is 15.56 2. 4 A permutation is a combination of data arranged in a particular order.576.11 11.67 13.Chapter 3: Basic probability and counting rules 101 Figure 3. yellow.000 If zero is not permitted in the first place. The number . n! n(n 1)(n 2)(n 3) … 1 3(xiv) Red Red Yellow Yellow Blue Blue Yellow Blue Blue Red Red Yellow Blue Yellow Red Blue Yellow Red 3! 3*2*1 6 This is the factorial rule.89 Frequency of occurrence (%) 11.33 5. 0! 1.78 2 3 4 5 6 7 8 9 10 11 12 Total value of the sum of the two dice 10 * 10 * 10 * 26 * 26 * 26 17. Table 3.000 Table 3. red.78 2.56 5.89 13. the number of ways that the three colours. and blue can be arranged is. the last term in equation 3(xiv) is really (n n) or 0.

The number of ways. Jim. assume that there are four candidates for two positions in an operating committee: Dan. combinations. Jim. Again.12 gives the combinations. differs from Rule No. 4. nP x Table 3. 5. Dan is the president and Sue is the secretary. Table 3. whereas it is unimportant for combinations. For a given set of items the number of permutations will always be more than the number of combinations because with permutations the order of the data is important. Choice 1 2 3 4 5 6 n! (n x)! 3(xv) President Dan Dan Dan Sue Sue Jim Vice president Sue Jim Ann Jim Ann Ann Suppose there are four candidates Dan.102 Statistics for Business Table 3. Here the same two people can be together. of arranging x objects selected in order from a total of n objects is. . from n objects is given by. For example in the 1st choice. or permutations. and Ann. permutations. In Excel we can use [function COMBIN] to directly calculate the result. 5 A combination is a selection of distinct items regardless of order. The number of ways a president and secretary can be chosen now without the same two people working together. Note that the Rule No. and Ann. In the 6th choice their positions are reversed.12 Combinations for organizing an operating committee. of arranging x objects. regardless of position is by equation 3(xvi) 4C 2 4! 2!(4 2)! 6 Combinations of objects: Rule No. providing they have different positions. 4P 2 nC x n! x !(n x)! 3(xvi) 4! (4 − 2)! 12 In Excel we use [function PERMUT] to calculate the result. Sue. or combinations.11 Choice President Secretary Permutations in organizing an operating committee. regardless of order. 1 Dan Sue 2 Dan Jim 3 Dan Ann 4 Sue Jim 5 Sue Ann 6 Sue Dan 7 Jim Ann 8 Jim Dan 9 Jim Sue 10 Ann Dan 11 Ann Sue 12 Ann Jim of ways. by the value of x! in the denominator.11 gives the various permutations. Sue. who have volunteered to work on an operating committee: the number of ways a president and secretary can be chosen is by equation 3(xv). Table 3. Sue is the president and Dan is the secretary.

components are connected in parallel. we use joint probability. (We do not know in advance the number of slots on the roulette wheel – but the casino does!). Relative frequency probability is derived from collected data and is thus also called empirical probability. where we can put a monetary value on the outcome of a particular action. This gives a backup to the system. then the system fails. In probability we talk about an event. Gambling involving dice. which is the confidence that we have in the product or process operating under prescribed conditions without failure. and a particular outcome of an event is shown by part of this surface. Odds are related to probability but odds are the ratio of the chances of losing to the chances of winning. cards. . If one component fails. or in succession. Probability may be subjective and this is the “gut” feeling or emotional response of the individual making the judgment. Classical probability is also a priori probability because before any action occurs we know in advance all possible outcomes. To visually demonstrate relationships in classical probabilities we can use Venn diagrams where a surface area. such as a circle. or roulette wheels are examples of classical probability since before playing we know in advance that there are six faces on a die. even though small. the power system in a hospital. When the probability of failure. or a bank’s computerbased information system. Posterior probability is Bayes’ Theorem. represents an entire sample space. The last part of the chapter has dealt with mathematical counting rules. which can be modified to avoid double accounting. the cost is always higher for a parallel arrangement since we have a backup that (we hope) will hardly. the addition rule gives the chance that two or more events occur. In gambling. can be catastrophic such as for an airplane in flight. or never. we use joint probability. System reliability and probability A system is a combination of components in a product or many of the process activities that makes a business function. When one event has already occurred then this gives posterior probability meaning the new chance based on the condition that another event has already happened. However. If a system is made up of series components then we must rely on all these series components working. be used. The probability of failure of parallel systems is always less than the probability of failure for series systems for given individual component probabilities. An extension of probability is risk. or system failure. A third is classical or marginal probability. To determine the probability of two or more events occurring together. We often refer to the system reliability. we refer to the odds of something happening. or does not happen. Chapter Summary Basic probability rules Probability is the chance that something happens. particularly in horse racing. which is the outcome of an experiment that has been undertaken. To determine the system reliability. Within classical probability. which is the ratio of the number of desired outcomes to the total number of possible outcomes. 52 cards in a pack. on the downside.Chapter 3: Basic probability and counting rules 103 This chapter has introduced rules governing basic probability and then applied these to reliability of system design.

the possible arrangements is given by kn. k2. For given values of n and x the value using permutations is always higher than for combinations. exactly the number of combinations. the number of licence plate combinations that are possible when using a mix of numbers and letters. with given criteria. for example. n! for the number of different ways of organizing n objects. as we know in advance. arrangements. The first rule is that for a fixed number of possible events. The third rule uses the factorial relationship. k3 and k4 then the possible arrangements are k1 * k2 * k3 * k4. or outcomes that are possible. respectively. The second rule is if we have events of different types say k1. . k.104 Statistics for Business Counting rules Counting rules do not involve probabilities. n. Permutations gives the number of possible ways of organizing x objects from a sample of n when the order is important. This rule will indicate. they are a sort of a priori conditions. The fourth and fifth rules are permutations and combinations. However. then for an experiment with a sample of size. Combinations determine the number of ways of organizing x objects from a sample of n when the order is irrelevant.296. If we throw a single die 4 times then the possible arrangements are 64 or 1.

with replacement. Gardeners’ gloves Situation A landscape gardener employs several students to help him with his work. If three gloves are selected at random from the box. what is the probability that all three are left handed? 4. what is the probability that a pair of gloves will be selected? (One glove is right handed and one glove is left handed. what is the probability that both gloves selected will be right handed? 5. what is the probability that both gloves selected will be right handed? 2. what is the probability that a correct pair of gloves will be selected? 2.) 3. If two gloves are selected at random from the box. without replacement. Market Survey Situation A business publication in Europe does a survey or some of its readers and classifies the survey responses according to the person’s country of origin and their type of work. If two gloves are selected at random from the box. with replacement. One morning they come to work and take their gloves from a communal box. Country Denmark France Spain Italy Germany Consultancy 852 254 865 458 598 Engineering 232 365 751 759 768 Investment banking 541 842 695 654 258 Product marketing 452 865 358 587 698 Architecture 385 974 845 698 568 Required 1. Required 1. without replacement. If two gloves are selected at random from the box. with replacement. What is the probability that a survey response taken at random comes from a reader in Italy? . This information according to the number or respondents is given in the following contingency table. This box contains only five left-handed gloves and eight right-handed gloves. If two gloves are selected at random from the box.Chapter 3: Basic probability and counting rules 105 EXERCISE PROBLEMS 1.

106

Statistics for Business

2. What is the probability that a survey response taken at random comes from a reader in Italy and who is working in engineering? 3. What is the probability that a survey response taken at random comes from a reader who works in consultancy? 4. What is the probability that a survey response taken at random comes from a reader who works in consultancy and is from Germany? 5. What is the probability that a survey response taken at random from those who work in investment banking comes from a reader who lives in France? 6. What is the probability that a survey response taken at random from those who live in France is working in investment banking? 7. What is the probability that a survey response taken at random from those who live in France is working in engineering or architecture?

3. Getting to work

Situation

George is an engineer in a design company. When the weather is nice he walks to work and sometimes he cycles. In bad weather he takes the bus or he drives. Based on past habits there is a 10% probability that George walks, 30% he uses his bike, 20% he drives, and 40% of the time he takes the bus. If George walks, there is a 15% probability of being late to the office, if he cycles there is a 10% chance of being late, a 55% chance of being late if he drives, and a 20% chance of being late if he takes the bus. 1. 2. 3. 4. On any given day, what is the probability of George being late to work? Given that George is late 1 day, what is the probability that he drove? Given that George is on time for work 1 day, what is the probability that he walked? Given that George takes the bus 1 day, what is the probability that he will arrive on time? 5. Given that George walks to work 1 day, what is the probability that he will arrive on time?

4. Packing machines

Situation

Four packing machines used for putting automobile components in plastics packs operate independently of one another. The utilization of the four machines is given below.

Packing machine A 30.00%

Packing machine B 45.00%

Packing machine C 80.00%

Packing machine D 75.00%

Chapter 3: Basic probability and counting rules

107

Required

1. What is the probability at any instant that both packing machine A and B are not being used? 2. What is the probability at any instant that all machines will be idle? 3. What is the probability at any instant that all machines will be operating? 4. What is the probability at any instant of packing machine A and C being used, and packing machines B and D being idle?

5. Study Groups

Situation

In an MBA programme there are three study groups each of four people. One study group has three ladies and one man. One has two ladies and two men and the third has one lady and three men.

Required

1. One person is selected at random from each of the three groups in order to make a presentation in front of the class. What is the probability that this presentation group will be composed of one lady and two men?

6. Roulette

Situation

A hotel has in its complex a gambling casino. In the casino the roulette wheel has the following configuration.

9 8 7 6

1 2 3 4

5

5

4 3 2 1 9 8 7

6

108

Statistics for Business

There are two games that can be played: Game No. 1 Here a player bets on any single number. If this number turns up then the player gets back 7 times the bet. There is always only one ball in play on the roulette wheel. Game No. 2 Here a player bets on a simple chance such as the colours white or dark green, or an odd or even number. If this chance occurs then the player doubles his/her bet. If the number 5 turns up, then all players lose their bets. There is always only one ball in play on the roulette wheel.

Required

1. In Game No. 1 a player places £25 on number 3. What is the probability of the player receiving back £175? What is the probability that the player loses his/her bet? 2. In Game No. 1 a player places £25 on number 3 and £25 on number 4. What is the probability of the player winning? What is the probability that the player loses his/her bet? If the player wins how much money will he/she win? 3. In Game No. 1 if a player places £25 on each of several different numbers, then what is the maximum numbers on which he/she should bet in order to have a chance of winning? What is this probability of winning? In this case, if the player wins how much will he/she win? What is the probability that the player loses his entire bet? How much would be lost? 4. In Game No. 2 a player places £25 on the colour dark green. What is the probability of the player doubling the bet? What is the probability of the player losing his/her bet? 5. In Game No. 2 a player places £25 on obtaining the colour dark green and also £25 on obtaining the colour white. In this case what is the probability a player will win some money? What is the probability of the player losing both bets? 6. In Game No. 2 a player places £25 on an even number. What is the probability of the player doubling the bet? What is the probability of the player losing his/her bet? 7. In Game No. 2 a player places £25 on an odd number. What is the probability of the player doubling the bet? What is the probability of the player losing his/her bet?

7. Sourcing agents

Situation

A large international retailer has sourcing agents worldwide to search out suppliers of products according the best quality/price ratio for products that it sells in its stores in the United States. The retailer has a total of 131 sourcing agents internationally. Of these 51 specialize in textiles, 32 in footwear, and 17 in both textiles and footwear. The remainder are general sourcing agents with no particular specialization. All the sourcing agents are in a general database with a common E-mail address. When a purchasing manager from any of the retail stores needs information on its sourced products they send an E-mail to the general database address. Anyone of the 131 sourcing agents is able to respond to the E-mail.

Chapter 3: Basic probability and counting rules

109

1. Illustrate the category of the specialization of the sourcing agents on a Venn diagram. 2. What is the probability that at any time an E-mail is sent it will be received by a sourcing agent specializing in textiles? 3. What is the probability that at any time an E-mail is sent it will be received by a sourcing agent specializing in both textiles and footwear? 4. What is the probability that at any time an E-mail is sent it will be received by a sourcing agent with no specialty? 5. Given that the E-mail is received by a sourcing agent specializing in textiles what is the probability that the agent also has a specialty in footwear? 6. Given that the E-mail is received by a sourcing agent specializing in footwear what is the probability that the agent also has a specialty in textiles?

8. Subassemblies

Situation

A subassembly is made up of three components A, B, and C. A large batch of these units are supplied to the production site and the proportion of defective units in these is 5% of the component A, 10% of the component B, and 4% of the component C.

Required

1. What proportion of the finished subassemblies will contain no defective components? 2. What proportion of the finished subassemblies will contain exactly one defective component? 3. What proportion of the finished subassemblies will contain at least one defective component? 4. What proportion of the finished subassemblies will contain more than one defective component? 5. What proportion of the finished subassemblies will contain all three defective components?

9. Workshop

Situation

In a workshop there are the four operating posts with their average utilization as given in the following table. Each operating post is independent of the other.

Operating post Drilling Lathe Milling Grinding

Utilization (%) 50 40 70 80

110

Statistics for Business

Required

1. 2. 3. 4. What is the probability of both the drilling and lathe work post not being used at any time? What is the probability of all work posts being idle? What is the probability of all the work posts operating? What is the probability of the drilling and the lathe work post operating and the milling and grinding not operating?

10. Assembly

Situation

In an assembly operation of a certain product there are four components A, B, C, and D which have an individual reliability of 98%, 95%, 90%, and 85%, respectively. The possible ways of assembly the four components making certain adjustments, are as follows.

Method 1

A 99.00% B 96.00%

B 96.00% C 90.00%

C 90.00% D 92.00%

D 92.00%

Method 2 A 99.00% A 99.00% B 96.00% Method 3 C 90.00% D 92.00% A 99.00% Method 4 C 90.00% D 92.00% B 96.00%

Chapter 3: Basic probability and counting rules

111

Required

1. Determine the system reliability of each of the four possible ways of assembling the components. 2. Determine the probability of system failure for each of the four schemes.

**11. Bicycle gears
**

Situation

The speeds on a bicycle are determined by a combination of the number of sprocket wheels on the pedal sprocket and the rear wheel sprocket. The sprockets are toothed wheels over which the bicycle chain is engaged and the combination is operated by a derailleur system. To change gears you move a lever, or turn a control on the handlebars, which derails the chain onto another sprocket. A bicycle manufacturer assembles customer made bicycles according to the number of speeds desired by clients.

Required

1. Using the counting rules, complete the following table regarding the number of sprockets and the number of gears available on certain options of bicycles.

Bicycle model A B C D E F G H I J

Pedal sprocket 1 2 2 3 3 4 4

Rear wheel sprocket 1 2 4 5 7 7 9

Number of gears 2 6 10 12 28 32

**12. Film festival
**

Situation

The city of Cannes in France is planning its next film festival. The festival will last 5 days and there will be seven films shown each day. The festival committee has selected the 35 films which they plan to show.

Required

1. How many different ways can the festival committee organize the films on the first day?

112

Statistics for Business

2. If the order of showing is important, how many different ways can the committee organize the showing of their films on the first day? (Often the order of showing films is important as it can have an impact on the voting results.) 3. How many different ways can the festival committee organize the films on (a) the second, (b) the third, (c) the fourth, (d) and the fifth and last day? 4. With the conditions according the Question No. 3, and again the order of showing the films is important, how many different ways are possible on (a) the second, (b) the third, (c) the fourth, (d) and the fifth and last day?

**13. Flag flying
**

Situation

The Hilton Hotel Corporation has just built two large new hotels, one in London, England and the other in New York, United States. The hotel manager wants to fly appropriate flags in front of the hotel main entrance.

Required

1. If the hotel in London wants to fly the flag of every members of the European Union, how many possible ways can the hotel organize the flags? 2. If the hotel in London wants to fly the flag of 10 members of the European Union, how many possible ways can the flags be organized assuming that the hotel will consider all the flags of members of the European Union? 3. If the hotel in London wants to fly the flag of just five members of the European Union, how many possible ways can the flags be organized assuming that the hotel will consider all the flags of members of the European Union? 4. If the hotel in New York wants to fly the flag of all of the states of the United States how many possible ways can the flags be organized? 5. If the hotel in New York wants to fly the flag of all of the states of the United States in alphabetical order by state how many possible ways can the flags be organized?

**14. Model agency
**

Situation

A dress designer has 21 evening gowns, which he would like to present at a fashion show. However at the fashion show there are only 15 suitable models to present the dresses and the designer is told that the models can only present one dress, as time does not permit the presentation of more than 15 designs.

Required

1. How many different ways can the 21 different dress designs be presented by the 15 models?

Chapter 3: Basic probability and counting rules

113

2. Once the 15 different dress designs have been selected for the available models in how many different orders can the models parade these on the podium if they all walk together in a single file? 3. Assume there was time to present all the 21 dresses. Each time a presentation is made the 15 models come onto the podium in a single file. In this case how many permutations are possible in presenting the dresses?

15. Thalassothérapie

Situation

Thalassothérapie is a type of health spar that uses seawater as a base of the therapy treatment (thalassa from the Greek meaning sea). The thalassothérapie centres are located in coastal areas in Morocco, Tunisia, and France, and are always adjacent or physical attached to a hotel such that clients will typically stay say a week at the hotel and be cared for by (usually) female therapists at the thalassothérapie centre. A week stay at a hotel with breakfast and dinner, and the use of the health spar may cost some £6,000 for two people. A particular thalassothérapie centre offers the following eight choices of individual treatments.5 1. Bath and seawater massage (bain hydromassant). This is a treatment that lasts 20 minutes where the client lies in a bath of seawater at 37°C where mineral salts have been added. In the bath there are multiple water jets that play all along the back and legs, which help to relax the muscles and improve blood circulation. 2. Oscillating shower (douche oscillante). In this treatment the client lies face down while a fine warm seawater rain oscillates across the back and legs giving the client a relaxing and sedative water massage (duration 20 minutes). 3. Massage under a water spray (massage sous affusion). This treatment is an individual massage by a therapist over the whole body under a fine shower of seawater. Oils are used during the massage to give a tonic rejuvenation to the complete frame (duration 20 minutes). 4. Massage with a water jet (douche à jet). Here the client is sprayed with a high-pressure water jet at a distance by a therapist who directs the jet over the soles of the feet, the calf muscles, and the back. This treatment tones up the muscles, has an anti-cramp effect and increases the blood circulation (duration 10 minutes). 5. Envelopment in seaweed (enveloppement d’algues). In this treatment the naked body is first covered with a warm seaweed emulsion. The client is then completely wrapped from the neck down in a heavy heated mattress. This treatment causes the client to perspire eliminating toxins and recharges the body with iodine and other trace elements from the seaweed (duration 30 minutes).

5

Based on the Thalassothérapie centre (Thalazur), avenue du Parc, 33120 Arcachon, France July 2005.

114

Statistics for Business

6. Application of seawater mud (application de boue marine). This treatment is very similar to the envelopment in seaweed except that mud from the bottom of the sea is used instead of seaweed. Further, attention is made to apply the mud to the joints as this treatment serves to ease the pain from rheumatism and arthritis (duration 30 minutes). 7. Hydro-jet massage (hydrojet). In this treatment the client lies on their back on the bare plastic top of a water bed maintained at 37°C. High-pressure water jets within the bed pound the legs and back giving a dry tonic massage (duration 15 minutes). 8. Dry massage (massage à sec). This is a massage by a therapist where oils are rubbed slowly into the body toning up the muscles and circulation system (duration 30 minutes). In addition to the individual treatments, there are also the following four treatments that are available in groups or which can be used at any time: 1. Relaxation (relaxation). This is a group therapy where the participants have a gym session consisting of muscle stretching, breathing, and mental reflection (duration 30 minutes). 2. Gymnastic in a seawater swimming pool (Aquagym). This is a group therapy where the participants have a gym session of running, walking, and jumping in a swimming pool (duration 30 minutes). 3. Steam bath (hammam). The steam bath originated in North Africa and is where clients sits or lies in a marble covered room in which hot steam is pumped. This creates a humid atmosphere where the client perspires to clean the pores of the skin (maximum recommended duration, 15 minutes). 4. Sauna. The sauna originated in Finland and is a room of exotic wood panelling into which hot dry air is circulated. The temperature of a sauna can reach around 100°C and the dryness of the air can be tempered by pouring water over hot stones that add some humidity (maximum recommended duration, 10 minutes).

Required

1. Considering just the eight individual treatments, how many different ways can these be sequentially organized? 2. Considering just the four non-individual treatments, how many different ways can these be sequentially organized? 3. Considering all the 12 treatments, how many different ways can these be sequentially organized? 4. One of the programmes offered by the thalassothérapie centre is 6 days for five of the individual treatments alternating between the morning and afternoon. The morning session starts at 09:00 hours and finishes at 12:30 hours and the afternoon session starts at 14:00 hours and finishes at 17:00 hours. In this case, how many possible ways can a programme be put together without any treatment appearing twice on the same day? Show a possible weekly schedule.

Chapter 3: Basic probability and counting rules

115

**16. Case: Supply chain management class
**

Situation

A professor at a Business School in Europe teaches a popular programme in supply chain management. In one particular semester there are 80 participants signed up for the class. When the participants register they are asked to complete a questionnaire regarding their sex, age, country of origin, area of experience, marital status, and the number of children. This information helps the professor organize study groups, which are balanced in terms of the participant’s background. This information is contained in the table below. The professor teaches the whole group of 80 together and there is always 100% attendance. The professor likes to have an interactive class and he always asks questions during his class.

Required

When you have a database with this type of information, there are many ways to analyse the information depending on your needs. The following gives some suggestions, but there are several ways of interpretation. 1. What is the probability that if the professor chooses a participant at random then that person will be: (a) From Britain? (b) From Portugal? (c) From the United States? (d) Have experience in Finance? (e) Have experience in Marketing? (f) Be from Italy? (g) Have three children? (h) Be female? (i) Is greater than 30 years in age? (j) Are aged 25 years? (k) Be from Britain, have experience in engineering, and be single? (l) From Europe? (m) Be from the Americas? (n) Be single? 2. Given that a participant is from Britain then, what is the probability that that the person will: (a) Have experience in engineering? (b) Have experience in purchasing? 3. Given that a participant is interested in finance, then what is the probability that person is from an Asian country? 4. Given that a participant has experience in marketing, then what is the probability that person is from Denmark? 5. What is the average number of children per participant?

116 Statistics for Business Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 Sex M F F F F M M F M F M M F M F F F F M M F M M M M F M F M F F F F M M M M F F M F M F M M M Age 21 25 27 31 23 26 25 29 32 21 26 28 27 35 21 26 25 31 22 20 26 28 29 35 41 25 23 23 25 26 22 26 28 24 23 25 26 24 25 28 31 32 26 21 25 24 Country United States Mexico Denmark Spain France France Germany Canada Britain Britain Spain United States China Germany France Germany Britain China Britain Britain Germany Portugal Germany Luxembourg Germany Britain Britain Denmark Denmark Norway France Portugal Spain Germany Britain United States Canada Canada Denmark Norway France Britain Britain Luxembourg China Japan Experience Engineering Marketing Marketing Engineering Production Production Engineering Production Engineering Finance Engineering Finance Engineering Production Engineering Marketing Production Production Production Marketing Engineering Engineering Engineering Production Finance Marketing Engineering Production Marketing Finance Marketing Engineering Engineering Production Engineering Production Engineering Marketing Marketing Engineering Finance Engineering Finance Marketing Marketing Production Marital status Married Single Married Married Married Single Single Single Married Single Married Single Married Married Married Married Married Single Married Single Married Single Single Married Married Single Married Single Single Married Single Married Single Married Single Married Married Single Single Married Married Married Single Single Married Married Children 0 2 0 2 0 3 0 3 2 1 2 0 3 0 2 3 3 4 2 3 2 1 0 0 3 0 3 3 2 3 2 3 3 2 1 0 0 2 0 3 5 2 3 2 5 2 .

Chapter 3: Basic probability and counting rules 117 Number 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 Sex F F M F F F M F M F M M M M F M F M F F M M M F M F M F M F M F M M Age 25 26 24 21 31 35 38 39 23 25 26 23 25 26 24 25 28 31 32 25 25 25 26 24 25 26 28 25 26 31 40 25 26 23 Country France Britain Germany Taiwan China Britain United States China Portugal Indonesia Portugal Britain China Canada Mexico China France United States Britain Germany Spain Portugal Luxembourg Taiwan Luxembourg Britain United States France France Germany France Spain Portugal Taiwan Experience Marketing Marketing Production Engineering Engineering Marketing Marketing Engineering Purchasing Engineering Purchasing Marketing Purchasing Engineering Purchasing Engineering Production Marketing Marketing Engineering Purchasing Engineering Production Marketing Production Engineering Engineering Engineering Production Marketing Engineering Marketing Purchasing Production Marital status Single Married Single Married Single Married Married Single Married Married Married Single Married Single Married Single Married Single Married Single Married Single Single Single Married Married Single Married Single Single Married Single Married Single Children 0 3 2 1 3 0 5 2 3 2 2 0 3 0 3 0 1 2 3 0 2 1 3 0 1 2 3 0 0 0 3 2 1 1 .

This page intentionally left blank .

after dinner. How does the retailer manage this randomness? Further. and the supermarket is full of people buying groceries. Perhaps in the shopping mall there is a supermarket.Probability analysis for discrete data 4 The shopping mall How often do you go to the shopping mall – every day. you need a new coat. or on the weekends? Why do you go? It might be that you have nothing else better to do. or perhaps just once a month? When do you go? Perhaps after work. in the morning when you think you can beat the crowds. once a week. when these potential customers are at the mall they behave in a binomial fashion – either they buy or they do not buy. It is Saturday. All these variables of when and why people go to the mall represent a complex random pattern of potential customers. . it is a grey. you need a new pair of shoes. dreary day and it is always bright and cheerful in the mall. you are going to meet some friends. How to manage the waiting line or the queue at the cashier desk? This chapter covers some of these concepts. you fancy buying a couple of CDs. you want to see a film in the evening so you go to the mall a little early and just have a look around.

factors like the weather. The number of cars on a particular stretch of road on any given day is random and knowing the pattern would help us to decide on the appropriateness of installing stop signs.120 Statistics for Business Learning objectives After you have studied this chapter you will learn the application of discrete random variables. or whole numbers. Besides gambling. or traffic signals for example. within the range of the possible values of the data. 5 hotel rooms are vacant. These subjects are treated as follows: ✔ ✔ ✔ Distribution for discrete random variables • Characteristics of a random variable • Expected value of rolling two dice • Application of the random variable: Selling of wine • Covariance of random variables • Covariance and portfolio risk • Expected value and the law of averages Binomial distribution • Conditions for a binomial distribution to be valid • Mathematical expression of the binomial function • Application of the binomial distribution: Having children • Deviations from the binomial validity Poisson distribution • Mathematical expression for the Poisson distribution • Application of the Poisson distribution: Coffee shop • Poisson approximated by the binomial relationship • Application of the Poisson–binomial relationship: Fenwick’s Discrete data are statistical information composed of integer values. For example. the number of people arriving at a shopping mall in any particular day is random. For example as illustrated in the Box Opener “The shopping mall”. In our gambling situation. With discrete data there is a clear segregation and the data does not progress from one class to another. Characteristics of a random variable Random variables have a mean value and a standard deviation. 2934 bottles have been sold. we could say that 9 machines are shutdown. They originate from the counting process. there are many situations in the environment that occur randomly and often we need to understand the pattern of randomness in order to make appropriate decisions. The number of people seeking medical help at a hospital emergency centre is random and again understanding the pattern helps in scheduling medical staff and equipment. or 3 students are absent. or the hour of the day. It is information that is unconnected. 5 2 hotel rooms are empty. The mean value of random data is the weighted average of all the possible . 8 units are defective. It is true that in some cases of randomness. If we knew the pattern it would help to better plan staff needs. then they are considered discrete random variables. every value has an equal chance of occurring. and how to use the binomial and the Poisson distributions. 29 bottles have been sold. do influence the magnitude of the data but often even if we know these factors the data are still random. Distribution for Discrete Random Variables If the values of discrete data occur in no special order. the value obtained by throwing a single die is random and the drawing of a card from a full pack is random. This means that. discussed in Chapter 3. the day of the week. 812 units ⁄ ⁄ 1 are defective. It makes little sense to say 912 machines are ⁄ shutdown. or 314 ⁄ ⁄ students are absent. and there is no explanation of their configuration or distribution.

If we assume that this particular pattern of randomness might be repeated we call this mean also the expected value of the random variable.2 the total value of the possible throws is. or the number of possible throws to give this total value is 11. Application of the random variable: Selling of wine Assume that a distributor sells wine by the case and that each case generates €6. The last column of Table 4. Let us turn this situation around and ask the question. 8. σ ∑(x μx )2 P(x) 4(iii) The following demonstrates the application of analysing the random variable in the throwing of two dice. “What is the expected value obtained in the throwing two dice. and 12) by adding the numbers from the two dice.2 and the probability P(x) of obtaining these totals is in Column 3 of the same table. or the chance of obtaining that value x.8333 6. 9.3901. Sales data for the last 200 days is given in Table 4. from equation 4(iii) the standard deviation is. 7. 6. The standard deviation of a random variable is the square root of the variance or.Chapter 4: Probability analysis for discrete data outcomes of the random variable and is given by the expression: Mean value. Thus. Another way that we can determine the average value of the number obtained by throwing two dice is by using equation 2(i) for the mean value given in Chapter 2: ∑x 2(i) N From Column 1 of Table 4.2 gives the calculation for the variation of obtaining the Number 7 using equation 4(ii). The expected value of throwing two dice is 7 as shown in the last line of Column 5. x ∑x N 77 11 7 Expected value of rolling two dice In Chapter 3. If we consider that this data is representative of future sales. Variance. x ∑x 2 3 11 4 12 5 6 77 7 8 9 10 121 Here x is the value of the discrete random variable.1 gives the possible 36 combinations that can be obtained on the throw of two dice. or E(x). The sale of wine is considered random. A and B?” We can use equation 4(i) to answer this question. Using equation 4(i) we can calculate the expected or mean value of throwing two dice and the calculation and the individual results are in Columns 4 and 5.3. The number of possible ways that these 11 totals can be achieved is summarized The following is a business-related application of using the random variable. there are just 11 different possible total values (2. The variance of a distribution of a discrete random variable is given by the expression. then the frequency of occurrence of sales can be used to estimate the expected or . The total in the last line of Column 4 indicates the probability of obtaining these eleven values as 36/36 or 100%. except that instead of dividing by the number of data values. The value N. Standard deviation.67%.00 in profit. μx ∑x * P(x) E(x) 4(i) in Column 2 of Table 4. and P(x) is the probability. we used combined probabilities to determine that the chance of obtaining the Number 7 on the throw of two dice was 16. which gives a straight average. σ2 ∑(x μx )2 P(x) 4(ii) This is similar to the calculation of the variance of a population given in Chapter 2. Finally. As this table shows of the 36 combinations. Table 4. 11. 5. 4. 10. 3. 40. here we are multiplying by P(x) to give a weighted average.

5556 0.0556 0. Die A Die B Total Throw No.1 Throw No.1667 1. Die A Die B Total Throw No.0000 2.8333 0.5000 9.3333 E(x) 7.1667 0. Number of possible ways 1 2 3 4 5 6 5 4 3 2 1 36 Probability P(x) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 36/36 x * P(x) Weighted value of x 0. 1 1 1 2 13 1 3 4 25 1 5 6 2 2 1 3 14 2 3 5 26 2 5 7 3 3 1 4 15 3 3 6 27 3 5 8 4 4 1 5 16 4 3 7 28 4 5 9 5 5 1 6 17 5 3 8 29 5 5 10 6 6 1 7 18 6 3 9 30 6 5 11 7 1 2 3 19 1 4 5 31 1 6 7 8 2 2 4 20 2 4 6 32 2 6 8 9 3 2 5 21 3 4 7 33 3 6 9 10 4 2 6 22 4 4 8 34 4 6 10 11 5 2 7 23 5 4 9 35 5 6 11 12 6 2 8 24 6 4 10 36 6 6 12 Table 4.0000 (x μ) (x μ)2 (x μ)2 P(x) 2 * (1/36) 3 * (2/36) 4 * (3/36) 5 * (4/36) 6 * (5/36) 7 * (6/36) 8 * (5/36) 9 * (4/36) 10 * (3/36) 11 * (2/36) 12 * (1/36) 5 4 3 2 –1 0 1 2 3 4 5 25 16 9 4 1 0 1 4 9 16 25 1.3889 2.0000 1.0000 7.0000 0.3333 0.122 Statistics for Business Table 4.1111 1.1111 4. Die A Die B Total Possible outcomes on the throw of two dice.2222 0.2 Value of throw (x) 2 3 4 5 6 7 8 9 10 11 12 Total Expected value of the outcome of the throwing of two dice.8333 1.8333 0.3333 40.6667 3.8333 .6111 0.7778 8.

an estimate of future profits is €6. σ 0.00 40.9937 4(iv) Table 4.9875 cases2 2 2 Figure 4. Here.00 .75)2 * 25% 0.00 * 11.75)2 * 40% 11.75) (12 11.75) * 15% (11 11. from equation 4(iv) Probability of selling 12 cases is 80/200 40.4. Using equation 4(ii) to calculate the variance.00 20.9875 0.00% The complete probability distribution is given in Table 4. and the histogram of this frequency distribution of the probability of sale is in Figure 4.4 Cases of wine sold over the last 200 days. 45 Frequency of this number sold (%) 40 35 30 25 20 15 10 5 0 10 11 12 Cases of wine sold 13 15.75 €70.3 Cases of wine sold over the last 200 days.50/day. we have. Cases of wine 10 11 12 sold per day Days this amount 30 40 80 of wine is sold 13 Total days 50 200 Table 4. 11. σ 2 (10 * 20% (13 11. “days this amount of wine is sold” are used to calculate the probability of future sales using the relationship.75 cases From this.1 Frequency distribution of the sale of wine. of future profits. the values.00 25. Using equation 4(i) to calculate the mean value.Chapter 4: Probability analysis for discrete data average value. Probability of selling amount x days amount of x sold total days considered in analysis For example.1. μx 10 * 15% 13 * 25% 11 * 20% 12 * 40% Cases sold per day Days this amount of wine is sold Probability of selling this amount (%) 10 30 15 11 40 20 12 80 40 13 Total 50 200 25 100 123 Using equation 4(iii) to calculate the standard deviation we have.

per $1. X.5. This implies that the returns on the (125 (300 158.50) * 35% * (10 89.767.034.75) .13 The covariance between the two investments is negative.75) 158.000 invested. X. Table 4.19 89. μx 20% * $100 45% * $300 $158. has a higher expected value than the bond fund.50) * 45% $ 13. 4(v) σxy ∑(x μx)(y μy)P(xy) Here x is a discrete random variable in the first dataset and y is a discrete random variable in the second dataset.75)2 * 20% (125 158. Variance (x y) 2 σ(x y) Probability of economic change (%) High growth fund (X ) Bond fund (Y ) 20 $100 $250 35 $125 $100 45 $300 $10 μy 20% * $250 45% * $10 $89.75 35% * $125 The high growth fund. Contracting Stable Expanding Covariance of random variables Covariance is an application of the distribution of random variables and is useful to analyse the risk associated with financial investments.75) * (250 89.50)2 σ2 y σ2 x σ2 y 2σxy 4(vii) (250 * 35% $8.124 Statistics for Business These calculations give a plausible approach of estimating average long-term future activity on the condition that the past is representative of the future. Using equation 4(i) to calculate the mean or expected values.50 35% * $100 Using equation 4(ii) to calculate the variance.50) * 20% * (100 89. E(x y) E(x) E(y) μx μy 4(vi) The variance of the sum of two random variables is.034. which can be used to analyse portfolio risk. Assume that you are considering investing in two types of investments. Y.89 89. However. and the other is essentially a bond fund. according to expectations of the future outlook of the macro economy is given in Table 4. One is a high growth fund. we have σ2 x ( 100 158. the standard deviation of the high growth fund is higher and this is an indicator that the investment risk is greater. σxy ( 100 158.50)2 * 45% 89. If we consider two datasets then the covariance. The expected value of the sum of two random variables is.50)2 * 20% (100 (10 89.75)2 * 45% $22.5 Economic change Covariance and portfolio risk. Using equation 4(v) to calculate the covariance. σx σy 22.767. σxy. An estimate of future returns. we have.75 Using equation 4(iii) to calculate the standard deviation. The terms μx and μy are the mean or expected values of the corresponding datasets and P(xy) is the probability of each occurrence.75)2 * 35% (300 158. Y.75 150.64 The standard deviation is the square root of the variance or Standard deviation (x y) 2 σ(x y) 4(viii) Covariance and portfolio risk An extension of random variables is covariance. between two discrete random variables x and y in each of the datasets is.9 8.483.

D. Although there is a higher expected return when the weighting in the high growth fund is more.13] 7.000 times where we have a 50% probability of obtaining (1 α)μy 4(ix) The risk associated with a portfolio is given by: 2α(1 α)σxy] 4(x) Assume that we have 40% of our investment in the high-risk fund. the other is decreasing and vice versa.19 0. On bases.000. If α is the assigned weighting to the asset X. quite a lot of the money put into slot machines does flow out as jackpots but about 6% rests with the house. or even tomorrow. 2 σ(x y) $3..60 . Then from equation 4(ix) the portfolio expected return is. not they would go out of business! With probability.767. This law says that the average value obtained in the long term will be close to the expected value. It is the value that is expected to be obtained in the long run.96. this is not the value that will occur next.602 *$8.402 * $22. 20 October 2005.B.2 gives a graph of the expected return according to the associated risk. Figure 4.000 invested there is a risk of $7. or one-armed bandits. or expected value in probability situations. 125 From equation 4(vii) the variance of the sum of the two investments is.75 2 * $ 13. it is the law of averages that governs. The long-term result corresponding to the law of averages can be explained by Figure 4.50 $248.1 Thus if you continue playing. the portfolio has an expected return of $117. From equation 4(vi) the expected value of the sum of the two investments is. you may win a few games.835. This illustrates the tossing of a coin 1.50 $117.75 60% * $89. is.20.75 $89. In fact. then since there are only two assets the situation is binomial and thus the weighting for the other asset is (1 α).40 * 0. Further for every $1. then in the long run you will lose because the gambling casinos have set their machines so that the casino will be the long-term winner. E(P) [α2σ 2 x μp (1 αμx 2 α)2σ y Expected values and the law of averages When we talk about the mean. 034. 1 . This shows that the minimum risk is when there is 40% in the high growth fund and 60% in the bond fund.Chapter 4: Probability analysis for discrete data investments are moving in the opposite direction. In gambling for example. p. which is the weighted outcome based on each of the probability of occurrence. which means there is 60% in the bond fund.767.835. The portfolio expected return for an investment of two assets. μp αμx (1 α)μy 40% * $158.96 Henriques.483. 2 σ(x y) 22. problem gamblers battle the odds. when you play the slot machines.034. This implies that there is less risk with the joint investment than just with an individual investment. 5.20 From equation 4(x) the risk associated with this portfolio is [α2σ2 x (1 α)2σ2 y 2α(1 α) σxy] [0. μx μy $158. If. International Herald Tribune. E(P).483.25 Thus in summary.13 $3.19 8.93 The standard deviation of the two funds is less than standard deviation of the individual funds because there is a negative covariance between the two investments. or when the return on one is increasing.3. or since this amount is based on an investment of $1. *$ 13.69 $61.72%. there is a return of 11. there is a higher risk. In the short term we really do not know what will happen.69 From equation 4(viii) the standard deviation of the sum of the two investments is.75 2 * 0.

000 times.00 0.00 30.100 Number of coin tosses Expected value.00 60.00 20. and risk ($) .000 1.00 10.00 70. 100.00 50. $ Risk.3 Tossing a coin 1. $ Figure 4. 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Proportion in the high risk investment (%) Expected value.00 40.2 Portfolio analysis: expected value and risk.00 90.00 0 100 200 300 400 500 600 700 800 900 1.00 Percentage of heads obtained (%) 80.126 Statistics for Business Figure 4.

Probability.6 gives some various qualitative outcomes using p and q. this is binomial. the binomial-type experiment. The binomial distribution is discrete. The idea of failure here simply means the opposite of what you are testing or expecting. or “success” is p. However. lose face. as we continue tossing the coin. such as a pack of 52 cards. steal. often in the long run the law of averages catches up with them. if the population is finite. q (1 p) Success Failure Win Lose Works Defective Good Bad Present Absent Pass Fail Open Shut Odd Even Yes No . 127 Conditions for a binomial distribution to be valid In order for the binomial distribution to be valid we consider that each observation is selected from an infinite population. After 1. of obtaining an outcome must be fixed over time and that the outcome of any result must be independent of a previous result. then the selection has to be with replacement. In the short term these people may get away with it. as we toss the coin. respectful. This is the norm. and the cumulative number of heads obtained approaches the cumulative number of tails obtained. We first develop a binomial distribution. the probability of obtaining heads or tails Binomial Distribution In statistics. You can perhaps apply the law of averages on a non-quantitative basis to the behaviour in society. and ethical. p Probability. of society’s behaviour. if we say that the probability of obtaining one outcome.Chapter 4: Probability analysis for discrete data heads and a 50% probability of obtaining tails. p. Other criteria for the binomial distribution are that the probability. However. then the probability of obtaining the other. This is a binomial condition. be corrupt. or “failure. Alternatively. There are a few people who might cheat. This illustration supports the Rule 1 of the counting process given in Chapter 2. or one of a very large size usually without replacement. Table 4. In the early throws. are punished or may be removed from society! does not. the law of averages comes into play. which is a table or a graph showing all the possible outcomes of performing many times.” is q. We are educated to be honest. or the average. If in a market survey. The y-axis of the graph is the cumulative frequency of obtaining heads and the x-axis is the number of times the coin is tossed. the cumulative number of heads obtained may be more than the cumulative number of tails as illustrated. a respondent is asked if she likes a product. binomial means there are only two possible outcomes from each trial of an experiment.000 throws we will have approximately 500 heads and 500 tails. For example.6 Qualitative outcomes for a binomial occurrence. Again. Since there are only two possible outcomes. or be violent. The tossing of a coin is binomial since the only possible outcomes are heads or tails. then the alternative response must be that she does not. They get caught. If we know beforehand that a situation exhibits a binomial pattern then we can use the knowledge of statistics to better understand probabilities of occurrence and make suitable decisions. In quality control for the manufacture of light bulbs the principle test is whether the bulb illuminates or Table 4. in the tossing of a coin. The value of q must be equal to (1 p).

again with a probability outcome of 50%.5 * 0. In the throwing a die. Variance σ2 n*p*q 40 * 0.5 The binomial random variable x can have any integer value ranging from 0 to n.5 20 x) 4(xii) The variance of the binomial distribution is the product of the number of trials. p (1 q p) 0. In the drawing of a card from a full pack. if we tossed a coin 40 times then the mean or expected value would be. In the binomial function.128 Statistics for Business remains always at 50% and obtaining a head on one toss has no effect on what face is obtained on subsequent tosses. Probability of x successes. Again. If a card is replaced after the drawing. This is the case in the coin toss experiment. out of n observations are possible. the number of trials undertaken. an odd or even number can be thrown. and the pack shuffled. or selecting a black and red card from a pack. the expression. The expected value of the binomial distribution E(x) or the mean value.00 Standard deviation. σ σ2 (n * p * q) 10 3. We have already presented this expression in the counting process of Chapter 3.16 . or the probability of success. and the characteristic probability of failure.00% 4(xi) an even or odd number on throwing a die. 40 * 0. n number of trials undertaken. σ2 n*p*q 4(xvi) ● ● ● p is the characteristic probability. px * q(n x) 4(xiii) is the probability of obtaining exactly x successes out of n observations in a particular sequence. μx E(x) n*p 4(xv) For example. the characteristic probability of success. is the product of the number of trials and the characteristic probability. then q is 50% and the resulting binomial distribution is symmetrical regardless of the sample size. q (1 p) or the probability of failure. obtaining 10. In these three illustrations we have the following relationship: Probability. The standard deviation of the binomial distribution is the square root of the variance. For each result one throw has no bearing on another throw. if p 50%. When p is not equal to 50% the distribution is skewed. the results of subsequent drawings are not influenced by previous drawings.5 or 50. or the sample size. x number of successes desired. in n trials n! ⋅ p x ⋅ q(n x! (n x)! ● is how many combinations of the x successes. The relationship. n. μx. n! x ! (n x)! 4(xiv) Mathematical expression of the binomial function The relationship in equation 4(xii) for the binomial distribution was developed by experiments carried out by Jacques Bernoulli (1654–1705) a Swiss/French mathematician and as such the binomial distribution is sometimes referred to as a Bernoulli process. the probability of obtaining a black card (spade or clubs) or obtaining a red card (heart or diamond) is again 50%. σ σ2 (n * p * q) 4(xvii) Again for tossing a coin 40 times.

6.00 Probability of obtaining this cumulative value of x (%) 0. Develop a complete binomial distribution for this situation and interpret its meaning.34 27. The histogram corresponding to this data is shown in Figure 4. One of the candidates has to be chosen.78 5. the interview process is binomial – either a particular ● ● ● Probability of having exactly two boys 16.75%.41%.34 93. or 7 boys) 93.Chapter 4: Probability analysis for discrete data 129 Application of the binomial distribution: Having children Assume that Brad and Delphine are newly married and wish to have seven children. 4.50(7 2) 2! (7 2)! p(x p(x 2) 2) 5. We interpret this information as follows: ● Deviations from the binomial validity Many business-related situations may appear to follow a binomial situation meaning that the probability outcome is fixed over time.50 (though not a feasible value) Standard deviation is (n * p * q) (7 * 0. Probability of having more than two boys (3. the random variable can take on the values. and 7. Probability of having less than two boys (0 or 1 boy) 6.34 16.1890 3.75 99. and the result of one outcome has no bearing on another. 1.22 100. [function BINOMDIST] the complete probability distribution for each of the possible outcomes can be obtained. In the genetic makeup of both Brad and Delphine the chance of having a boy and a girl is equally possible and in their family history there is no incidence of twins or other multiple births.41 27. 7! p(x 2) 0. Consider for example. What is the probability of Delphine giving birth to exactly two boys? For this situation. Probability of having at least two boys (2.50 * 0. 3.25 * 0. . 4. 0.7 Probability distribution of giving birth to a boy or a girl.50 boys 2. 6.78 6.00 77. a manager is interviewing in succession 20 candidates for one position in his firm. 1.4.25 * 0.78 100.41% 0 1 2 3 4 5 6 7 Total Mean value is n * p 7 * 0. x 2 and from equation 4(xii).47 16. 6. or 7 boys) 77. Thus.25%.0313 2 * 120 21 * 0. Each candidate represents discrete information where their experience and ability are independent of each other. This is given in Table 4.7 for the individual and cumulative values.66 50.34%.1313 16.47 0. We do not need to go through individual calculations as by using in Excel. 5. n.50) 0.502 0.41 5.00% Probability of obtaining exactly this value of x (%) 0. in practice these two conditions might be violated. 2. 5. However. 5. Sample size (n) Probability (p) Random variable (X) 7 50.00 ● p q 50% x.25 22. ● ● Table 4. 040 0. 4. the sample size is 7 For this particular question. 3.

34 27.47 16.41 27. Thus. or the number of customers waiting in line at the cash checkout . the interviewer gets tired and may be inclined to offer the post to say perhaps one of the last few remaining candidates out of shear desperation! In another situation. As the manager continues the interviewing process he makes a subliminal comparison of competing candidates. and even electronic components wear. Denis Poisson (1781–1840).78 5. is another discrete probability distribution to describe events that occur usually during a given time interval.34 candidate is selected or is not. However. The fact that your car started on Tuesday morning should have no effect on whether it starts on Wednesday and should not have been influenced on the fact that it started on Monday morning. the number of patients arriving at the emergency centre of a hospital in one day. if no candidate has been selected. named after the Frenchman. Further.78 0.4 Probability histogram of giving birth to a boy (or girl). in that if one candidate that is rated positively this results perhaps in a less positive rating of another candidate.41 16. consider you drive your car to work each morning. over time. the evaluation is not entirely independent. Thus. or it does not. 30 Probability of giving birth to exactly this number (%) 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 0 1 2 3 4 Number of boys (or girls) 5 6 7 0. mechanical. as the day goes on. This is binomial and your expectation is that your car will start every time. Illustrations might be the number of cars arriving at a tollbooth in an hour.130 Statistics for Business Figure 4. electrical. on one day you turn the ignition in your car and it does not start! Poisson Distribution The Poisson distribution.47 5. When you get into the car. or the number of airplanes waiting in a holding pattern to land at a major airport in a given 4-hour period. either it starts.

The standard deviation of the Poisson distribution is given by the square root of the mean number of occurrences or. Application of the Poisson distribution: Coffee shop A small coffee shop on a certain stretch of highway knows that on average nine people per Since the probability of at least 13 customers entering in a given hour is 12. 828. The owner has decided that if there is greater than a 10% chance that there will be at least 13 clients coming into the coffee shop in a given hour. This distribution is shown in Table 4.000123 6. P(13) λ13e 9 13! 2. x is the Poisson random variable. P(x) is the probability of exactly x occurrences.71828. To determine the probability of there being exactly 13 customers coming into the coffee shop in a given hour we can use equation 4(xviii) where in this case x is 13 and λ is 9. Probability of more than 13 customers entering in a given hour (100 92. hour come in for service. The number of occurrences in a given onesecond interval is independent of the time at which that one-second interval occurs during the overall prescribe time period. 329 * 0. the manager will hire another waitress. σ (λ) 4(xix) In applying the Poisson distribution the assumptions are that the mean value can be estimated from past data.61) 7. Further. the probability of obtaining the exact random variable.58%. 227. and Column 3 gives the cumulative values. . e is the base of the natural logarithm. Figure 4. P(x) is.39%. Column 2 gives the probability of obtaining exactly the random number.42% or greater than 10% the manager should decide to hire another waitress. Probability of less than 13 customers entering in a given hour 87. if we divide the time period into seconds then the following applies: ● ● ● ● The probability of exactly one occurrence per second is a small number and is constant for every one-second interval.8. Probability of at least 13 customers entering in a given hour (100 87. ● ● ● Probability of exactly 13 customers entering in a given hour 5. 020.04% Again as for the binomial distribution. The probability of two or more occurrences within a one-second interval is small and can be considered zero.42%. 1. 800 5. The number of occurrences in any one-second interval is independent on the number of occurrences in any other one-second interval.Chapter 4: Probability analysis for discrete data as highlighted in the Box Opener “The shopping mall”. 865. Develop the information to help the manager make a decision. P (x) ● λx e λ x! 4(xviii) ● ● ● λ (lambda the Greek letter l) is the mean number of occurrences.04%. and sometimes there are only a few customers.58) 12. you can simply calculate the distribution using in Excel the [function POISSON]. Sometimes the only waitress on the shop is very busy.5 gives the distribution histogram for Column 2. or 2. 541. This distribution is interpreted as follows: ● 131 Mathematical expression for the Poisson distribution The equation describing the Poisson probability of occurrence.

74 70.60 80.00 4(xx) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Total The Poisson random variable. The probability relationship from equation 4(xviii) then becomes.01 0. From past data it is known that on a daily basis on average one Fenwick will not be properly recharged and thus not available for use.57 20. we have the Poisson distribution given in Column 2 and Column 5 of Table 4.58 92. However.99 100.62 2. and p is less than. the number of successes out of n observations cannot be greater than the sample size n. The criteria most often applied to make this approximation is when n is greater.24 1. The same Fenwick’s are used as needed to take products out of storage and transfer them to the loading area.11 0.132 Statistics for Business If this requirement is met then the mean of the binomial distribution.96 99.89 99. is small.47 99.9.28 5.01 0.70 7.00 100. Mean value (λ) Random variable (x) Probability of obtaining exactly (%) 0. or equal to 0. when the distribution is used as an approximation of the binomial distribution.39 45. p.86 9. From equation 4(xx) the probability of observing a large number of successes becomes small and tends to zero very quickly when n is large and p is small. Now if we use the binomial approximation. These 25 Fenwick’s are battery driven and at the end of the day they are plugged into the electric supply for recharging.8 Poisson distribution for the coffee shop.12 5. The following illustrates this approximation. 1.09 0.98 99.50 1. then the characteristic probability Poisson approximated by the binomial relationship When the value of the sample size n is large.18 13.85 97. λ.1313% or about 6%.76 99. we can use the Poisson distribution as a reasonable approximation of the binomial distribution.37 6. which it uses every day for unloading and putting into storage products it receives on pallets from its suppliers.30 87.18 11. x in theory ranges from 0 to .68 32.61 95.12 0. can be substituted for the mean of the Poisson distribution. which is given by the product n * p. Application of the Poisson–binomial approximation: Fenwick’s A distribution centre has a fleet of 25 Fenwick trolleys. What is the probability that on any given day.01 0.07 9.71 13.14 0.00 9 Probability of obtaining this cumulative value of x (%) 0.94 1. From this table the probability of three Fenwick’s being out of service on any given day is 6.03 0.58 0.80 98.50 3.11 11.50 11.89 99.05% or 5%. three of the Fenwick’s are out of service? Using the Poisson relationship equation 4(xviii) and generating the distribution in Excel by using [function POISSON] where lambda is 1. and the characteristic probability of occurrence. or equal to 20.04 3.29 0. P (x) (np)x e x! ( np ) Table 4.06 0.57 58. .

0511 0.0000 0.0000 0.0000 0.0000 0.0000 0.0397 37.71 11.0000 0.0000 0.0000 0.0004 0.50 0.0000 0.11 9.0038 0.0000 0.0000 0.11 1.94 1.1313 1.0000 0.0009 0.0000 0.70 8 7.86 10 9.0000 25 1 4.0000 0.58 0.9962 1. Number of Fenwick’s λ p Random variable X 0 1 2 3 4 5 6 7 8 9 10 11 12 Poisson (%) Exact 36.0000 100.0000 0.00% Binomial (%) Exact 36.0000 0.00 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Number of customers arriving in a given hour Table 4.3741 0.03 0.9 Poisson and binomial distributions for Fenwick’s.0000 0.0000 0.0000 0.01 0.0000 0.29 0.7879 36.0073 0.0000 0.0000 0.04 6 4 3.50 1.2405 0.28 6.09 0.01 0.0000 0.0000 100.Chapter 4: Probability analysis for discrete data 133 Figure 4.0000 0.00 .3940 6.37 3.18 13.07 5.00 Binomial (%) 0.0000 0.0000 0.0000 Random variable 13 14 15 16 17 18 19 20 21 22 23 24 25 Total Poisson (%) 0.24 2 0.0000 0.14 0.06 0.7707 5.0000 0. 14 13.7879 18.5413 18.0334 0.3066 0.5 Poisson probability histogram for the coffee shop.0001 0.0000 0.5328 0.18 12 Frequency of this occurrence (%) 11.

9 we have given the probabilities to four decimal places to be able to compare values that are very close. You can also notice that the probability of observing a large number of “successes” tails off very quickly to zero. or even the day after. If we know that data follows a binomial pattern. and the characteristic failure. in Table 4. these relationships can be useful in estimating long-term profits. the standard deviation is the square root of the variance. The standard deviation is the square root of the product of the sample size. In this case it is for values of x beyond the Number 5. yes or no. Then applying the binomial relationship of equation 4(xii) by generating the distribution using [function BINOMDIST] we have the binomial distribution in Column 3 and Column 6 of Table 4. the probability of three Fenwick’s being out of service is 5. the number of Fenwick’s. and the outcome of an activity must be independent of another. Chapter Summary This chapter has dealt with discrete random variables. costs. the characteristic probability. This is about the same result as using the Poisson relationship. We will never know exactly what will happen tomorrow. or the norm. to approach the expected value. the number of passengers waiting for the Tube. or the number of cars using the motorway is relatively random. The mean or the expected value of the random variable is the weighted outcome of all the possible outcomes. As always. any number is likely to appear. This indicates that on any given day. right or wrong.9. their corresponding distribution. An extension of the random variable is covariance analysis that can be used to estimate portfolio risk. The mean value in a binomial distribution is the product of the sample size and the characteristic probability. and we have the characteristic probability of occurrence. The variance is calculated by the sum. etc. The number of people in a shopping mall. Distribution for discrete random variables When integer or whole number data appear in no special order they are considered discrete random variables. The sample size n is 25.00%. Binomial means that there are only two possible outcomes. then for a given sample size we can determine. When we have the expected value and the dispersion or spread of the data. For the binomial distribution to be valid the characteristic probability must be fixed over time. or budget figures. of the square of the difference between a given random variable and the mean of data multiplied by the probability of occurrence.134 Statistics for Business is 1/25 or 4. the probability . The law of averages in life is underscored by the expected value in random variable analysis. This means that for a given range of values. and the binomial and Poisson distribution.9962% or again about 6%. Binomial distribution The binomial concept was developed by Jacques Bernoulli a Swiss/French mathematician and as such is sometimes referred to as the Bernoulli process. however over time or in the long range we can expect the mean value. Note. works or does not work. for example.

named after the Frenchman. which are considered fixed for the experiment in question. In order to determine the Poisson probabilities you need to know the average number of occurrences. the probability of a process operating in a given time period. When this is known. lambda. Although many activities may at first appear to be binomial in nature. Poisson distribution The Poisson distribution. over time the binomial relationship may be violated. the probability outcomes using either the Poisson or the binomial distributions are very close. or the probability outcome of a certain action. then the Poisson distribution can be approximated using the binomial relationship.Chapter 4: Probability analysis for discrete data 135 of a quantity of products being good. Denis Poisson. the standard deviation of the Poisson function is the square root of the average number of occurrences. In an experiment when the sample size is at least 20. and the characteristic probability is less then 5%. is another discrete distribution often used to describe patterns of data that occur during given time intervals in waiting lines or queuing situations. When these conditions apply. .

. Professor Michel is preparing his annual budget. indicate that from 300 to 315 patients per day are tested. Thus tomorrow’s number of patients is a random variable. Using the data in Table 1. Table 1 Men tested Days this level tested 2 7 10 12 12 14 18 20 24 22 18 16 12 5 4 4 Table 2 Men tested 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 Days this level tested 1 1 1 1 10 16 30 40 40 30 16 10 1 1 1 1 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 Required 1. Past daily records. Use the coefficient of variation (ratio of standard deviation to mean value or σ/μ) to compare the data. If the historical data for the testing is according to Table 2. The total direct and indirect cost for testing each patient is €50 and the clinic is open 250 days per year. HIV virus Situation The Pasteur Institute in Paris has a clinic that tests men for the HIV virus. Thus the random variable is the number of patients per day – a discrete random variable. what effect would this have on your budget? 3. 2. what will be a reasonable estimated cost for this particular operation in this budget year? Assume that the records for the past 200 days are representative of the clinic’s operation.136 Statistics for Business EXERCISE PROBLEMS 1. for the last 200 days. The Director of the clinic. The testing is performed anonymously and the clinic has no way of knowing how many patients will arrive each day to be tested. This data is given in Table 1.

If each car leased generates $22 in profit. and the number of days at which this level of cars are leased during 250 days per year when the leasing agencies are opened. what is a reasonable estimate of annual profit for the coming year for each agency? . Using the data from the Cheyenne agency. Do the shapes of the distributions corroborate the information obtained in Question 3? Which of the data is the most reliable for future analysis. Illustrate the distributions given by the two tables as histograms. He is developing his budgets for the following year and is proposing to use historical data for estimating his profits for the coming year. using the average value from the Cheyenne data. For the previous year he has accumulated data from two of his agencies one in Cheyenne. what is a reasonable estimate of the average number of cars leased per day during the year the analysis was made? 2. and the other in Laramie. This data shown below gives the number of cars leased. United States with 10 outlets in this state. Cheyenne Cars leased 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Days at this level 2 9 12 14 14 18 24 26 29 27 25 20 15 8 6 1 Cars leased 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Laramie Days at this level 1 1 2 2 12 20 38 49 50 37 19 13 2 2 1 1 Required 1.Chapter 4: Probability analysis for discrete data 137 4. and why? 2. Rental cars Situation Roland Ryan operates a car leasing business in Wyoming.

and the cost to the city. others involved injury. 3. If the data from Laramie was used. the council was disturbed by the number of road accidents that occurred. This including getting to the scene. and in a few cases. Plot a relative frequency probability for this data for the number of accidents that occurred. on average two members of the police force were dispatched together with three members of the fire service.138 Statistics for Business 3. death to those persons involved. and then writing a report. The estimated cost of the police was £35 per hour per person and £47 per hour per person for the fire service. On average each accident took 3 hours to investigate. how would this change the response to Question 1 for the average number of cars leased per day during the year the analysis was made? 4 If the data from Laramie was used. No. of days occurred 7 35 34 46 6 2 31 33 29 31 47 34 30 Required 1. of accidents (x) 0 1 2 3 4 5 6 7 8 9 10 11 12 No. doing whatever was necessary at the accident scene. . which of the data from Cheyenne or Laramie would be the most reliable? Justify your response visually and quantitatively. For estimating future activity for the leasing agency. These costs and injuries were obviously important but also the council wanted to know what were the costs for the services of the police and fire services. When an accident occurred. Road accidents Situation In a certain city in England. The higher cost for the fire service was because the higher cost of the equipment employed. how would this change the response to Question 2 of a reasonable estimate of annual profit for the coming year for all 10 agencies? 5. Some of these accidents were minor just involving damage to the vehicles involved. The council conducted a survey of the number of accidents that occurred and this is in the table below.

What is the highest frequency of occurrence for not meeting the promised time delivery? 3. Express delivery Situation An express delivery company in a certain country in Europe offers a 48-hour delivery service to all regions of the United States for packages weighing less then 1 kg. What is a reasonable estimate of the average number of packages that are not delivered within the promised time frame? 4. Do you think that there is a large variation for this data? 5. Using this data. What is an estimated cost for the annual services of the police services? 6.500. would it meet the target? 6. If the firm is unable to delivery within this time frame it refunds to the client twice the fixed charge of €42. 2. Plot a relative frequency probability for this data for the number of packages that were not delivered within the 48-hour time period. each month. What is an estimated cost for the annual services of the fire services? 7. The following table gives the number of packages of less than one kilogram. If the firm sets an annual target of not paying to the client more than €4. Month January February March April May June July August September October November December 2003 6 4 5 3 0 1 10 2 2 2 3 11 2004 4 6 2 0 5 6 7 9 10 1 1 3 2005 10 7 3 4 4 5 9 3 3 6 4 8 Required 1.Chapter 4: Probability analysis for discrete data 139 2. What qualitative comments can you make about this data that might in part explain the frequency of occurrence of not meeting the time delivery? . What is an estimated cost for the annual services of the police and fire services? 4. based on the above data.50. What is the standard deviation of the number of packages that are not delivered within the promised time frame? 5. what is a reasonable estimate of the daily number of accidents that occur in this city? 3. What is the standard deviation for this information? 4. which were not delivered within the promised time-frame over the last three years.

he will leave the store open for sales of those bookcases in stock. What are your comments about the answer to Question 3? 6. However.140 Statistics for Business 5. and what is this quantity? 3. What is the highest probability of selling bookcases. Investing Situation Sophie. which is a mixture of United States and European funds backed by the corresponding governments. plus selected technology companies. Economic change Probability of economic change (%) High growth fund. 2005. change ($/$1. The other fund is a bond fund. One is a high growth fund that invests in blue chip stocks of major companies.000 of investment. Bookcases Situation Jack Sprat produces handmade bookcases in Devon. what would be the expected financial situation for Jack? 4. Normally he operates all year-round but this year. because he is unable to get replacement help.00 using the expected value. Jack had 19 finished bookcases in his store/workshop. Sales for the previous 2 years were as follows: Month January February March April May June July August September October November December 2003 17 21 22 21 23 19 22 21 20 16 22 20 2004 18 24 17 21 22 23 22 19 21 18 22 15 Required 1. what is the expected number of bookcases sold per month? 2.000) Contracting 15 50 200 Stable 45 100 50 Expanding 40 250 10 . At the end of July 2005. he decides to close down his workshop in August and make no further bookcases. Using her knowledge of finance and economics Sophie established the following regarding probability and financial returns per $1. Based on the above historical data. United Kingdom. a shrewd investor. If the average sale price of a bookcase were £250.000) Bond fund change ($/$1. wants to analyse her investment in two types of portfolios.

What is the probability that at least four customers. will buy something during a specified hour? 8. and the bond fund. What is the probability that at least one customer. 3. buys something. who say they are browsing. 2. Required Assuming a binomial distribution. is 30%. 5. who say they are browsing. Determine the expected values of the high growth fund. Develop the individual probability distribution histogram for all the possible outcomes. Develop a table showing all the possible exact probabilities of acceptance if 20 candidates apply for this programme. . Required 1. and previous examination scores. and the bond fund. and the United States. Last year she evaluated that the probability of a customer who says they are just browsing. What is the probability that no customers. will buy something during a specified hour? 4.Chapter 4: Probability analysis for discrete data 141 Required 1. What is the expected value of the sum of the two investments? What is the expected value of the portfolio? What is the expected percentage return of the portfolio and what is the risk? 7. The acceptance for the programme follows a Bernoulli process. respond to the following questions 1. motivation. who say they are browsing. will buy something during a specified hour? 3. 2. Japan. There is a strong demand for this programme and selection is based on language ability for the country in question. Suppose that on a particular day this year 15 customers browse in the store each hour. Mexico. What is the probability that no more than four customers. will buy something during a specified hour? 5. Records show that in the 70% of the candidates that apply are accepted. Determine the standard deviation of the high growth fund. Determine the covariance of the two funds. who says they are browsing. Gift store Situation Madame Charban owns a gift shop in La Ciotat. 6. European Business School Situation A European business school has a 1-year exchange programme with international universities in Argentina. China. Australia. 4.

What is the probability that in the batch of eight defective boards. What is the probability that in the batch of eight defective boards. none can be corrected? 3. The distribution of defective boards follows a binomial distribution. Historical data indicates that of the defective boards. which is connected to the local network. The head of Information Systems makes a random sample of 10 inspections. Clocks Situation The Chime Company manufactures circuit boards for use in electric clocks. What is the probability that in the batch of eight defective boards. If 20 candidates apply. at least five can be corrected? 5. 3. If 20 candidates apply. Illustrate on a probability distribution histogram all of the possible individual outcomes of the correction possibilities from a batch of eight defective circuit boards. Develop a table showing all the possible cumulative probabilities of acceptance if 20 candidates apply for this programme. 2. what is the probability that at least 15 candidates will be accepted? 7. what is the probability that no more than 15 candidates will be accepted? 8. what is the probability that exactly 15 candidates will be accepted? 6. Computer printer Situation Based on past operating experience.142 Statistics for Business 2. all the possible exact probabilities of acceptance if 20 candidates apply for this programme. 40% can be corrected by redoing the soldering. What is the probability that in the batch of eight defective boards. what is the probability that fewer than 15 candidates will be accepted? 9. fewer than five can be corrected? 10. on a histogram. If 20 candidates apply. Illustrate. no more than five can be corrected? 6. If 20 candidates apply. What is the probability that in the batch of eight defective boards. what is the probability that exactly 10 candidates will be accepted? 5. Much of the soldering work on the circuit boards is performed by hand and there are a proportion of the boards that during the final testing are found to be defective. is operating 90% of the time. 4. the main printer in a university computer centre. . If 20 candidates apply. exactly five can be corrected? 4. Required 1.

In the random sample of 10 inspections. In the random sample of 10 inspections. 2. 6. Historical data at Betin’s marketing . Develop a probability histogram for this situation. Local merchants in surrounding communities accept this card. Assuming that credit acceptance. there is no annual cost for the card. Develop the probability distribution histogram for all the possible outcomes of the operation of the computer printer. is a Bernoulli process. 4. and samples of 15 applicants are made. Required 1. Customers meeting certain requirements can obtain a credit card called “BNP Wunder”. or rejection. 2. In the random sample of 10 inspections. what is the probability that the computer printer is operating in at least 9 of the inspections? 4. Bank credit Situation A branch of BNP-Paribas has an attractive credit programme. what is the probability that the computer printer is operating in exactly 9 of the inspections? 3. what is the probability that the computer printer is operating in at most 9 of the inspections? 5. In the random sample of 10 inspections. 7.Chapter 4: Probability analysis for discrete data 143 Required 1. 3. In how many inspections can the computer printer be expected to operate? 11. what is the probability that the computer printer is operating in more than 9 of the inspections? 6. what is the probability that the computer printer is operating in fewer than 9 of the inspections? 7. 5. The advantage is that with this card. Biscuits Situation The Betin Biscuit Company every August offers discount coupons in the Rhône-Alps Region. What is the probability that exactly three applicants will be rejected? What is the probability that at least three applicants will be rejected? What is the probability that more than three applicants will be rejected? What is the probability that exactly seven applicants will be rejected? What is the probability that at least seven applicants will be rejected? What is the probability that more than seven applicants will be rejected? 12. In the random sample of 10 inspections. France for the purchase of their products. goods can be purchased at a 2% discount and further. Past data indicates that 35% of all card applicants are rejected because of unsatisfactory credit.

no more than four are ejected from the line? 14. exactly four are ejected from the line? 4.000 bottles. 2.000 bottles. What is the probability that no more than three customers do not use the coupons? 13. What is the probability that more than four customers do not use the coupons for the Betin biscuits? 5. Bottled water Situation A food company processes sparkling water into 1.5 litre PET bottles. What is the probability that for 2. serve themselves with fuel such that payment is automatic.144 Statistics for Business department indicates that 80% of consumers buying their biscuits do not use the coupons. none are ejected from the line? 3. What is the probability that for 2. 0. This is the most usual form of purchase. Here the customers fill their tank and then drive to the exit and pay cash. What is the probability that for 2.000 bottles. attached to a hypermarket. One day eight customers enter into a store to buy biscuits. Required 1. What is the probability that for 2. What is the probability that less than eight customers do not use the coupons? 6. Plot this data as a relative frequency distribution. to one of two attendants at the exit kiosk. The speed of the bottling line is very high and historical data indicates that after filling.000 bottles. What is the probability that exactly six customers do not use the coupons for the Betin biscuits? 3. For 2. The other option is the cash-for-gas utilization area.000 bottles. develop a probability histogram from zero to 15 bottles falling from the line. What is the probability that for 2. Cash for gas Situation A service station. has two options for gasoline or diesel purchases. less than four are ejected from the line? 6. This form of distribution is more costly to the operator principally because of the salaries of the attendants in the kiosk. The owner of this service station wants some assurance that .15% of the bottles are ejected. What is the probability that exactly seven customers do not use the coupons? 4. Required 1. 2.000 bottles. This filling and ejection operation is considered to follow a Poisson distribution. Customers either using a credit card that they insert into the pump. at least four are ejected from the line? 5. Develop an individual binomial distribution for the data.

and the probability values that you have determined? 16. what is the probability that more than three cashiers do not show up for work? 5. Using the binomial distribution. based on the criteria given? 3. what is the probability that less than three cashiers do not show up for work? 8. what is the probability that less than three cashiers do not show up for work? 4. Using the binomial distribution. Using the Poisson distribution. what is the probability that more than three cashiers do not show up for work? 9. Using the Poisson distribution. Using the Poisson distribution. what is the probability that on any given day exactly three cashiers do not show up for work? 3. 365 days . Required 1. The Poisson relationship will be used for evaluation. Cashiers Situation A supermarket store has 30 cashiers full time for its operation. These pumps are installed to operate continuously.Chapter 4: Probability analysis for discrete data 145 there is a probability of greater than 90% that 12 or more customers in any hour use the automatic pump. Plot this data as a relative frequency distribution? 2. From past data. Past data indicates that on average 15 customers per hour use the automatic pump. 2. what is the probability that on any given day exactly three cashiers do not show up for work? 7. the absenteeism due to illness is 4. From the information obtained in Question 2 what might you propose for the owner of the service station? 15. Case: Oil well Situation In an oil well area of Texas are three automatic pumping units that bring the crude oil from the ground. 6. Develop an individual Poisson distribution for the data. Using the binomial distribution.5%. Required 1. Should the service station owner be satisfied with the cash-for-gas utilization. Develop a Poisson distribution for the cash-for-gas utilization area. Plot this data as a relative frequency distribution. 24 hours per day. Develop an individual binomial distribution for the data. What are your comments about the two frequency distribution that you have developed.

for each day of a 365-day year. Required Describe this situation in probability and financial terms. Each pump delivers 156 barrels per day of oil when operating normally and the oil is sold at a current price of $42 per barrel. When this occurs. In the table. Pump No. the automatic controller at the pump wellhead sends an alarm to a maintenance centre. Here there is always a crew on-call 24 hours a day. “1” indicates the pump is operating. 3 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 . 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 Pump No.146 Statistics for Business per year. per crewmember. There are times when the pumps stop because of blockages in the feed pipes and the severe weather conditions. The data below gives the operating performance of these three pumps in a particular year. When a maintenance crew is called in there is always a three-person team and they bill the oil company for a fixed 10-hour day at a rate of $62 per hour. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 Pump No. “0” indicates the pump is down (not operating).

2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 Pump No.Chapter 4: Probability analysis for discrete data 147 Pump No. 3 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 . 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 Pump No.

This page intentionally left blank .

0000 cl. These measurement anomalies can be explained by the normal distribution.0000 g of chocolate. or the weight of the bar of chocolate. Again. Some values will be higher. should not be consistently high since over time this would cost the producing firm too much money. right? You are almost certainly wrong as this implies a volume of 33. . net weight 100 g. When you buy a bar of black chocolate it is stamped on the label. you have exactly a volume of 33 cl in the can. just because of the variation of the filling process for the cans of beer or the moulding operation for the chocolate bars. The volume of the beer in the can. or the machine setting is to obtain a certain value it is just about impossible to always obtain this value. it is highly unlikely that you have 100. the volume or weight cannot be always too low as the firm will not be respecting the information given on the label and clearly this would be unethical. where the target. Conversely. In operations.Probability analysis in the normal distribution 5 Your can of beer or your bar of chocolate When you buy a can of beer written 33 cl on the label. and some will be lower.

are not whole numbers. It is valuable to understand the characteristics of the normal distribution as this can provide information about probability outcomes in the business environment and can be a vital aid in decision-making. or humped shaped and it is symmetrical around this hump. It is widely used in statistical analysis. and time there is no distinct cut-off point between the data values and they can overlap into other class ranges. As we have illustrated in the box opener “Your can of beer or your bar of chocolate”. It is a continuous distribution. and the y-axis is the frequency of occurrence of this random variable. As we mentioned in Chapter 3. We may note that the runner completed the Santa Barbara marathon in 3 hours and 4 minutes and 32 seconds. or that amount indicated on the label. The concept was developed by the German. mound. The nominal weight of a bar of chocolate is 100 g but the actual weight when measured may be in fact 99.8579. The following are the basic characteristics of the distribution: ● ● Describing the Normal Distribution A normal distribution is the most important probability distribution. the normal distribution. It is bell.1. The theory and concepts of this distribution are presented as follows: ✔ ✔ ✔ Describing the normal distribution • Characteristics • Mathematical expression • Empirical rule for the normal distribution • Effect of different means and/or different standard deviations • Kurtosis in frequency distributions • Transformation of a normal distribution • The standard normal distribution • Determining the value of z and the Excel function • Application of the normal distribution: Light bulbs Demonstrating that data follow a normal distribution • Verification of normality • Asymmetrical data • Testing symmetry and asymmetry by a normal probability plot • Percentiles and the number of standard deviations Using normal distribution to approximate a binomial distribution • Conditions for approximating the binomial distribution • Application of the normal–binomial approximation: Ceramic plates • Continuity correction factor • Sample size to approximate the normal distribution The normal distribution is developed from continuous random variables that unlike discrete random variables. is 33 cl. the nominal volume of beer in a can. then the normal distribution can be used as a measure of probability. weight. the actual volume when measured may be in fact 32. if the frequency of occurrence can represent future outcomes. or frequency of occurrence. For all these values of volume. Karl Friedrich Gauss (1777–1855) and is thus it is also known as the Gaussian distribution.7458 g.150 Statistics for Business Learning objectives After you have studied this chapter you will understand and be able to apply the most widely used tool in statistics. to describe a continuous random variable. but take fractional or decimal values. When it is . The x-axis is the value of the random variable. Characteristics The shape of the normal distribution is illustrated in Figure 5. However.

m 3s ● ● ● symmetrical it means that the left side is a mirror image of the right side. In addition. is given by the normal distribution density function. or the two tails of the normal distribution. may extend far from the central point implying that the associated random variable. x .14159. The central point. However.Chapter 5: Probability analysis in the normal distribution 151 Figure 5. Frequency or probability 3s Mean value. The inter-quartile range is equal to 1. the normal distribution is still a reasonable approximation. and midrange. mode. π is the constant pie equal to 3. σx is the standard deviation. Mathematical expression The mathematical expression for the normal distribution.71828. median. . The left and right extremities. They all have the same value.1 Shape of the normal distribution. x is the value of the random variable.33 standard deviations. f (x) 1 2πσx e (1/2)[(x μx )/ σx ]2 5(i) ● ● Regarding the tails of the distributions most real-life situations do not extend indefinitely in both directions. is at the same time the mean. has a range. e is the base of the natural logarithm equal to 2. μx is the mean value of the distribution. or the hump of the distribution. x. negative values or extremely high positive values would not be possible. and from which the continuous curve is developed. for these situations ● ● ● ● f(x) is the probability density function.

Kurtosis value is 0. s 2.2 Normal distribution: the same mean but different standard deviations. This means that the boundary limits of this almost 100% of the data are μ 3σ. However their means Figure 5.0. ● The same mean. and 10. here 2.0. the flatter is the curve and the deviation around the mean is greater. This means that the area under the curve represents all or 100% of the data. but different standard deviations as illustrated in Figure 5.50. Almost 100% (the exact value is 99.73%) of all the data falls within 3 standard deviations from the mean.60 s m 10. This means that the boundary limits of this 68% of the data are μ σ.00. Here the standard deviation is 10. here 10.0. and the standard deviation measures its spread or dispersion. Kurtosis value is 1. Kurtosis value is 5.5. Different means but the same standard deviation as illustrated in Figure 5. About 95% (the exact value is 95.3.66 s 5. the area under the curve is always unity.44%) of all the data falls within 2 standard deviations from the mean. The smaller the standard deviation.152 Statistics for Business Empirical rule for the normal distribution There is an empirical rule for the normal distribution that states the following: ● Effect of different means and/or different standard deviations The mean measures the central tendency of the data. The larger the standard deviation.26%) of all the data falls within 1 standard deviations from the mean.2.37 .00 for the three curves and their shape is identical.50. About 68% (the exact value is 68. Datasets in a normal distribution may have the following configurations: ● ● ● ● No matter the values of the mean or the standard deviation. the curve is narrower and the data congregates around the mean. Here there are three distributions with the same mean but with standard deviations of 2. This means that the boundary limits of this 95% of the data are μ 2σ. 5.00 respectively.

but different standard deviations.00 and a standard deviation of 10. The curve that has a small standard deviation. the shape of the normal distribution is determined by its standard deviation.50. Here the kurtosis value is 1.2.66. The sharper curve has a mean of 20. Here the flatter curve has a mean of 10. The middle curve has a mean of 0 and a standard deviation of 5. Different means and also different standard deviations are illustrated in Figure 5.2. the different standard deviations alter the sharpness or hump of the peak of the curve as illustrated by the three normal distributions given in Figure 5. and the mean value establishes its position on the x-axis. and 20 so that they have different positions on the x-axis. The curve that has a standard deviation.0 is platykurtic after the Greek word platy meaning broad.00 and a standard deviation of 2.Chapter 5: Probability analysis in the normal distribution 153 ● are 10. As such. In conclusion. and as shown in Figure 5.5 is leptokurtic after the Greek word lepto meaning slender.37.00. Figure 5.2. or flat. or the characteristic of the peak of a frequency distribution curve.3 Normal distribution: the same standard deviation but different means. However. a set of data can be uniquely defined by its mean and standard deviation.00. σ 2. and this flatness can be seen also in Figure 5. This difference in shape is the kurtosis. 0. the kurtosis value is 5. Kurtosis in frequency distributions Since continuous distributions may have the same mean.4. there is an infinite combination of curves according to their respective mean and standard deviation. s 10: m 20 s 10: m 10 s 10: m 0 60 50 40 30 20 10 0 10 20 30 40 50 60 . σ 10. The peak is sharp.

However. the volume of beer in cans. the weight of chocolate bars. Here the kurtosis value is 0. Since the numerator and the denominator (top and bottom parts of the equation) have the . tyre. There are centilitres for the beer.154 Statistics for Business Figure 5. The kurtosis value of a relatively flat peak is negative. In statistics. Meso from the Greek means intermediate.60.5: m 20 s 5: m 0 s 10: m 10 60 50 40 30 20 10 0 10 20 30 40 50 60 The intermediate curve where the standard deviation σ 5. grams for the chocolate. The kurtosis value can be determined in Excel by using [function KURT]. all these datasets can be transformed into a standard normal distribution using the following normal distribution transformation relationship: z x μx σx 5(ii) ● ● Transformation of a normal distribution Continuous datasets might be for example. or kilometres for the tyres.0 is called mesokurtic since the peak of the curve is in between the two others. recording the kurtosis value of data gives a measure of the sharpness of the peak and as a corollary a measure of its dispersion. σx is the standard deviation of the distribution. μx is the mean of the distribution of the random variables. z is the number of standard deviations from x to the mean of this distribution.4 Normal distribution: different means and different standard deviations. s 2. In the normal distribution the units for these measurements for the mean and the standard deviation are different. or the distance travelled by an automobile ● ● x is the value of the random variable. whereas for a sharp peak it is positive and becomes increasingly so with the sharpness. The importance of knowing these shapes is that a curve that is leptokurtic is more reliable for analytical purposes.

then z can be either plus or minus. The tyre lasts 37.00 the area under the curve represents 68.4 100. When z is negative it means that the area under the curve from the left is less than 50% and when z is positive it means that the area from the . we have.64%.50 1.23%. Assume one slab of chocolate is taken at random from the production line and its weight is 100.75 0.45%. since the value of x can be more.50 Thus. then the area under the curve represents 95. or less. the mean value of a certain size chocolate bar is 100 g and from past data we know that the standard deviation of a production lot of these chocolate bars is 0. When the values of z range from 2. of zero. there are no units for the value of z. than the mean value.5. Reorganizing equation 5(ii) to make x the subject. or 2.80 Thus in each case we have the same number of standard deviations. and 99.4 98. which give areas under the curve from the left-hand tail to the z-value of 32. The area under the curve to the left of the mean is 50. when z is 3 the value of x from equation 5(iii) is.60 0.60 100. for a certain format the mean value of beer in a can is 33 cl and from past data we know that the standard deviation of the bottling process is 0. when z is 2 the value of x from equation 5(iii) is.27% or about 68% of the data. For values of z ranging from 3. for values of z ranging from 1.00 to 3.Chapter 5: Probability analysis in the normal distribution same units. for the case of a bar of chocolate of a nominal weight of 100.00 g. as presented earlier.78. In this case using equation 5(ii).00 0. 0. This is as opposed to the value of the standard deviation. Assume that a single can of beer is taken at random from the bottling line and its volume is 33.80 Alternatively. σ. of the data. or close to 95%.75 cl.40 1.45. in using three different situations each with different units. x 100.00 to 2.00. 500 2. 000 1. 500 1.00 the area under the curve represents 99. x μx zσx 5(iii) Alternatively. 78. z x μx σx 100. z.73% or almost 100% of the data.06%.40 g. x. 250 1. respectively. The other values of x are calculated in a similar manner.50 Again assume that the mean value of the life of a certain model tyre is 35.40 g. This is how the normal frequency distribution can be used to estimate the probability of occurrence of certain situations. For example. Note the value of z is not necessarily a whole number but it can take on any numerical value such as 0.50 0.250 km.35. 250 35.00 ( 3) * 0. z x μx σx 37.000 km and from past data we know that the standard deviation of the life of a tyre is 1. We have converted the data to a standard normal distribution. Then suppose that one tyre is taken at random from the production line and tested on a rolling machine. and a population standard deviation of 0.00 to 1. μ.60 g. z x μx σx 33.00%.50 155 The standard normal distribution A standard normal distribution has a mean value.50 cl. Then using equation 5(ii).75 33.500 km. These relationships are illustrated in Figure 5. These areas of the curve are indicated with the appropriate values of z on the x-axis.00% and the area to the right of the mean is also 50. In this case using equation 5(ii). indicated on the x-axis are the values of the random variable. These values of x are determined as follows.00 0.00 2 * 0. Further. And. x 100. Also.40 0.

45% 68. or probability P(x). speed. These area values can also be interpreted as probabilities.80 2 99. μ. ● ● ● [function NORMDIST] determines the area under the curve. this book uses the Microsoft Excel function for the normal distribution.80 3 101. σ.73% 95. σ. given the area under the curve or the probability. These tables give the area of the curve either to the right or the left side of the mean and from these tables probabilities can be estimated. of the dataset. Thus for any data of any continuous units such as weight. μ. [function NORMSINV] the value of z given the area or probability. . the mean value. will contain the same proportion of the total area under the curve for any normal probability distribution. etc.156 Statistics for Business Figure 5. The logic of the z-values in Excel is that the area of the curve increases from 0% at the left to 100% as we move to the right of the curve. and the standard deviation. x.4 z x 3 98. [function NORMSDIST] the value of the area or probability.60 0 100. length. Instead of tables. 99. [function NORMINV] determines the value of the random variable.5 Areas under a standard normal distribution. given z.20 left of the curve is greater than 50%. volume.27% Frequency.40 2 100. z from the mean. given the value of the random variable x.00 1 100.20 1 99. which has a complete database of the z-values. the mean value. P(x). The following four useful normal distribution functions are found in Excel. or probability of occurrence Standard deviation is s 0. P(x). ● Determining the value of z and the Excel function Many books on statistics and quantitative methods publish standard tables for determining z. p. and the standard deviation. all intervals containing the same number of standard deviations.

μ.95% Life of a light bulb. is 84.05%. What is the probability that a light bulb of this kind selected at random from the production line will last no more than 3. when they are selected.250 2.250 hours. z 3. is also a constant with a value of 725 hours. 2. What is the probability that a light bulb of this kind selected at random from the production line will last at least 3.500 725 750 725 1. Thus. 1. Frequency of occurrence Area 84. Thus we can say that the probability of a single light bulb taken from the production line has an 84.500 hours and the standard deviation. is considered a constant at 2. hours 2. The standard deviation of this data is 725 hours and the illumination time of a light bulb is considered to follow a normal distribution. This area is (100% 84. Thus we can say that there is a 15. is 3.250.05% . as for all the Excel functions.250 hours. the mean value.6.250 It is not necessary to learn by heart which function to use because.250 hours. From [function NORMSDIST] the area under the curve from left to right. for z 1.0345.0345 Application of the normal distribution: Light bulbs General Electric Company has past data concerning the life of a particular 100-Watt light bulb that shows that on average it will last 2. This concept is shown on the normal distribution in Figure 5.6 Probability that the life of a light bulb lasts no more than 3. knowing the information that you have available. Thus for this situation.95%.95% probability of lasting not more than 3. where the random variable. x. The application of the normal distribution using the Excel normal distribution function is illustrated in the following example. tells you what normal function to use.500 3. σ.95%) or 15. it indicates what values to insert to obtain the result.250 hours? Here we are interested in the area of the curve on the right where x is at least 3.Chapter 5: Probability analysis in the normal distribution 157 Figure 5.250 hours? Using equation 5(ii).500 hours before it fails.

500 hours is (50.250 is greater than the mean.05%) 60.05%) 34.52% probability that a single light bulb taken from the production line will last no more than 2. We can determine this probability by several methods.05% Life of a light bulb.000 hours? Using equation 5(ii).000 hour is less than 2.000 hours.48%. Method 1 ● ● Area of the curve 2.00% and also the area of the curve to the right of the mean is 50.000 2.00%.000 and 3.00% 24.52%.95%.500 3.250 hours where 2.158 Statistics for Business Figure 5.6897 In this case we are interested in the area of the curve between 2. This is shown on the normal distribution in Figure 5. Area of the curve between 2. Thus. Method 2 Since the normal distribution is symmetrical.6897 is 24. Area of the curve 3.000 hours.250 hours and above is 15. What is the probability that a light bulb of this kind selected at random will last no more than 2.250 hours.52% from answer to Question 3. 3.05% probability of lasting at least 3. which it does since 2.00% 15. From [function NORMSDIST] the area of the curve for z 0.7 Probability that the life of a light bulb lasts at least 3.250 probability that a single light bulb taken from the production line has a 15. What is the probability that a light bulb of this kind selected at random will last between 2.8.500 hours. we can say that there is a 24. area between 2.250 hours? Thus.000 hours is to the left of the mean and 3.000 and 2.52% 15. is now 2. ● ● Area of the curve between 2. The fact that z has a negative value implies that the random variable lies to the left of the mean.000 hours and 3.00% 24. the area of the curve to the left of the mean is 50.7.250 hours is (100. where the random variable.250 hours. Thus. This is shown on the normal distribution curve in Figure 5.000 and 3. 4.000 hours and below is 24. . hours 2. x.52%) 25.500 and 3.250 hours is (50. Frequency of occurrence Area 15.500 725 500 725 0. z 2.05% from answer to Question 2.43%.

250 hours and below is 84.250 hours.500 Life of a light bulb.000 hours and below is 24.52%) 60.43%.52%.250 hours is (25. Area of the curve at 3. Frequency of occurrence Area 24.000 and 3.000 and 3.9 Probability that the light bulb lasts between 2.95% 24. Area Frequency of occurrence 60.9. hours Thus.48% 34.000 and 3.52% 2.43%. hours Figure 5.95%. This situation is shown on the normal distribution curve in Figure 5.8 Probability that the life of a light bulb lasts no more than 2.000 hours. area of the curve between 2.95%) 60.Chapter 5: Probability analysis in the normal distribution 159 Figure 5. Thus.250 hours is (84.000 2.500 3.000 2. area of the curve between 2. . Method 3 ● ● Area of the curve at 2.250 Life of a light bulb.43% 2.

by the area under the curve determined by the answer to Question No.500 3.50% as illustrated in Figure 5.50%. 3. Since the normal distribution is symmetrical.00% Area 12.1503.50% Area 12.50% 1. 6.00%) 25.500 1.000 and 3.666 2. how many bulbs would be expected to fail at 3. μx 2.000.95% 42.000 by the area under the curve by the answer determined in Question No. hours . the area on the right of the limit. x (upper limit) 2. at which 75% of the light bulbs will last? In this case we are interested in 75% of the middle area of the curve. given the value of 12.00%. or 50. From the normal probability functions in Excel.000 * 84.10.160 Statistics for Business 5. 4. or the left tail. Area Frequency of occurrence 75. From equation 5(iii) where z at the upper limit is 1. or 50. is also 12.666 hours These values are also shown on the normal distribution curve in Figure 5.477 light bulbs rounded to the nearest whole number.500 and σx is 725.1503 * 725 At the lower limit z is x (lower limit) 1. is 25/2 or 12. 7.477.216.00% 75. we multiply the population N. Again.000 * 60.96 or 30.1503 and on the right side. The area of the curve outside this value is (100.334 hours Figure 5.1503.43% 30.50%. the area on the left side of the limit. it is 1.000 of this particular light bulb in stock. or 50. since the curve is symmetrical the value of z on the left side is 1. What are the lower and upper limits in hours.250 hours? Again. 1.1503.250 hours or less? In this case we simply multiply the population N.000 of this particular light bulb in stock. If General Electric has 50. or 50.10 Symmetrical limits between which 75% of the light bulbs will last.334 Life of a light bulb. how many bulbs would be expected to fail between 2. then the numerical value of z is 1.10.217 light bulbs rounded to the nearest whole number. or the right tail. If General Electric has 50. Similarly.24 or 42.1503 * 725 2.500 1.1503 1. symmetrically distributed.

The percentage values are calculated by using the equation 5(iii) first to find the limits for a given value of z using the mean and standard deviation of the data. About 100% of the data lies between 3 standard deviations of the mean. since with this value it is easy to position the situation on the normal distribution curve. The inter-quartile range is equal to 1. It is a matter of preference which of the functions to use. ● ● 161 ● ● ● Demonstrating That Data Follow a Normal Distribution A lot of data follows a normal distribution particularly when derived from an operation set to a nominal value. can be developed to see if their profiles look normal. criteria. The weight of a nominal 100-g chocolate bar. will show if the data appears normal. However. For larger datasets a frequency polygon also developed in Chapter 1 or a box-and-whisker plot. This gives the probability directly. Another verification of the normal assumption is to determine the properties of the dataset to see if they correspond to the normal distribution Box-and-whisker plot . As an illustration. the volume of liquid in a nominal 33-cl beverage can. If they do then the following relationships should be close. For small datasets a stem-and-leaf display as presented in Chapter 1. and the value of x are entered. Some of the units examined will have values greater than the nominal figure and some less. The range of the data is equal to six times the standard deviation. standard deviation.1 gives the properties for the 200 pieces of sales data presented in Chapter 1.11 Sales revenue: comparison of the frequency polygon and its box-andwhisker plot.11 shows a frequency polygon and the box-and-whisker plot for the sales revenue data presented in Chapters 1 and 2. or the life of a tyre mentioned earlier follow a normal distribution. Figure 5.33 times the standard deviation. The information in Table 5. Frequency polygon Verification of normality To verify that data reasonably follows a normal distribution you can make a visual comparison. ● The mean is equal to the median value.Chapter 5: Probability analysis in the normal distribution In all these calculations we have determined the appropriate value by first determining the value of z. Then the amount of data between these limits is determined and this Figure 5. About 95% of the data lies between 2 standard deviations of the mean. there may be cases when other data may not follow a normal distribution and so if you apply the normal distribution assumptions erroneous conclusions may be made. A quicker route in Excel is to use the [function NORMDIST] where the mean. introduced in Chapter 2. About 68% of the data lies between 1 standard deviations of the mean. I like to calculate z.

00 35.490 138.296 71.589 105.865 164.157 98.695 105.876 87.985 86.654 100.298 88.598 68.897 134.597 99.825 165.654 149.456 96.459 138.985 47.895 102.562 89.856 83.295 83.975 82.910.654 70.935.659 125.458 112.985 127. 35.592 108.598 134.00% .67 100.894 93.985 122.598 47.128 86.75 43.653 99.479 116.895 102.985 80.081.854 76.957 117.33σ Normal plot 1σ 2σ 3σ Sales data 1σ 2σ 3σ Value 102.265 142.666.957.562 84.985 123.312 119.479 73.489 82.583 150.985 132.654 118.957 91.984 74.486 131.785 74.589 152.859 121.592 111.326 99.489 69.987 58.698 45.975.489 106.896 109.689 101.856 66.128 135.162 Statistics for Business Table 5.856 64.980 120.945 84.695 82.562 37.456 141.564 93.20 123.695 101.189 101.1 Sales revenues: properties compared to normal assumptions.987 59.965 184.985 96.498 107.584 97.489 163.859 67.897 73.654 124.865 152.597 120.17 41.198 110.654 62.564 79.987 86.584 133.568 103.459 136.298 61.632 123.00 30.456 143.987 165.785 108.489 170.569 139.75 79.654 97.654 69.987 77.888.590 89.890 72.958 106.598 153.579.984 90.498 56.259 64.598 80.786 132.598 120.894 75.864 160.50 184.289 130.895 100.378 109.964 56.128 77.487 81.459 96.489 87.598 95.73% Area under curve 64.486 85.985 103.987 87.985 151.432 140.879 126.958 54.298 112.215 107.492 103.698 81.999 90.569 104.298 92.295.578 161.875 137.759 58.754 101.897 68.498 88.796 123.597 85.958 168.895 63.958 89.562 156.598 55.589 136.958 113.987 72.847 124.982 50.598 129.958 63.895 52.258 162.564 113.985 91.598 77.598 128.832 121.459 78.874 Property Mean Median Maximum Minimum Range σ (population) Q3 Q1 Q3 Q1 6σ 1.698 78.695 89.958 102.986 60.689 114.27% 95.651 98.958 71.189 124.298 111.694 106.00% 100.598 78.598 133.598 85.598 66.295 95.569 184.856 110.865 103.859 146.50% 96.592 94.976 100.329.856 144.00 185.975 134.128 122.897 70.298 115.895 81.895 145.587 104.987 60.895 97.378.259 65.975 76.985 104.45% 99.985 92.30 Area under curve 68.00 149.

90.Chapter 5: Probability analysis in the normal distribution is converted to a percentage amount. there are 200 pieces of data between these limits and 200/200 100.427 2 * 30.8416) * 30. the normal assumption seems reasonable. and the properties of the sales data.667 40.678 Using Excel. x 102. The value of z at the point x with the Excel normal distribution function is 0. Figure 5. From the less than ogive.880 ● From the greater than ogive. Frequency of occurrence Area 80.2% greater than the value of $75.5307 3 * 30.880 3 * 30.12.000. and standard deviation values for the sales data using equation 5(iii) we have.50% x (for z x (for z 2) 2) 102. 80.880 2 * 30. ● 163 Using Excel. if we go back to Chapter 1 from the ogives for this sales data we showed that.667 30. there are 129 pieces of data between these limits and so 129/200 64.00% of the revenues are no more than $145. The following gives an example of the calculation.000 determined from the ogive. x (for z x (for z 1) 1) 102.00% of the sales revenues are at least $75.00% This value is only 2.12 Area of the normal distribution containing at least 80% of the data.880 $76.667 10.880 If we assume a normal distribution then at least 80% of the sales revenue will appear in the area of the curve as illustrated in Figure 5.787 133.667 19.667 ( 0.667 164.907 102. there are 192 pieces of data between these limits and 192/200 96. As a proof of this.00% x (for z x (for z 3) 3) 102. Using this and the mean.667 102.547 Thus from the visual displays.00% x m .027 102.880 71. Using Excel.8416.880 30.000.

13 Area of the normal distribution giving upper limit of 90% of the data.000 determined from the ogive.00% m x Similarly. x 102.667 1.880 $142. Here the distribution of the data has its mode. the median is the middle value and lies between the mode and the mean. the hump. Again. at the right end of the x-axis where there is a higher proportion of large values and lower proportion of relatively small values. or the highest frequency of occurrence. This concept of symmetry and asymmetry is illustrated by the following three situations. The median is the middle value and lies between the mode and the mean.2816 * 30. or the highest frequency of occurrence.164 Statistics for Business Figure 5. This is because it is the mean that is the most affected by extreme values and is pulled over to the right. Asymmetrical data In a dataset when the mean and median are significantly different then the probability distribution is not normal but is asymmetrical or skewed. Here . Frequency of occurrence Area 90. This is because it is the mean that is the most affected by extreme values and is pulled back to the left. If the mean value is less than the median. When the mean value of the dataset is greater than the median value then the distribution of the data is positively or right-skewed where the curve tails off to the right.9% less than the value of $145. Here the distribution of the data has its mode. then the data is negatively or left-skewed such that the curve tails off to the left. Using this and the mean. the hump.000 of its worldwide staff are shown by the frequency polygon and its associated box-and-whisker plot in Figure 5. For a certain consulting Firm A. and standard deviation values for the sales data using equation 5(iii) we have. A distribution is skewed because values in the frequency plot are concentrated at either the low (left side) or the high end (right side) of the x-axis.14.13. the monthly salaries of 1. The value of z at the point x with the Excel normal distribution function is 1.243 This value is only 1.2816. at the left end of the x-axis where there is a higher proportion of relatively low values and a lower proportion of high values. if we assume a normal distribution then 90% of the sales revenue will appear in the area of the curve as illustrated in Figure 5.

179 and 500. 500.000 23. Here the frequency polygon and the box-and-whisker plot are right-skewed. mode.179).000 16. and mean ($12.000 9.000 8. The maximum salary is still $21. or the other 50%.500).036 and $15.800 with the frequency at about 19.000 21.000 14. 500.036.752.000 22. or 50%.001 or the mean is 4.2% or essentially the mean.000 24. have a salary between $15. Thus in ascending order.036 and $19. Now.752 and the minimum is $10.000 13.15 is for consulting Firm B.907 or the mean is just 0. of the staff have a monthly salary between $10.Chapter 5: Probability analysis in the normal distribution 165 Figure 5. The maximum salary is $21.000 15.907 and $21.000 14.036 and $12.752 and the minimum $10.08% less than the median. The mean value is now $12. which explains the lower average value.000 12.0%.000 Monthly salary.000 17.000 22. The maximum salary is still $21.000 11. The mean value is $15.000 20.000 18.000 13. or 50% of the staff have a monthly salary between $10. or the other 50%. have a salary between .18% less than the median.000 12.752 or a larger range of smaller values than in the case of the symmetrical distribution.000 19.000 20.000 21.964). $ the data is essentially symmetrically distributed. have a salary between $12. 500.000 16.036. Here the frequency polygon and the box-and-whisker plot are left-skewed.752 and the minimum $10. Figure 5.207 and the median value is $19.000 10.500 with the frequency at about 24.000 9. median ($12.14 Frequency polygon and its box-and-whisker plot for symmetrical data. Figure 5.000 23.964 and the median value is $12.000 24. The mean value is now $18.036. From the graph the mode is about $11. 22% 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% 8. From the graph the mode is about $15. or the other 50%.000 19.000 17. Now. Thus.000 10. and median are approximately the same.179 and $21.000 11. of the staff have a monthly salary between $10.893 and the median value is $15.907 and 500.179 or the mean is 6.001 and 500.000 15.45% greater than the median. we have the mode ($11.16 is for consulting Firm C. or 50%.000 18.

26% 24% 22% 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% 8.000 17.166 Statistics for Business Figure 5.000 20.000 18.000 12.000 13.000 16.000 24. $ Figure 5.000 24.000 15.000 15.000 10.000 9.000 15.000 17.16 Frequency polygon and its box-and-whisker plot for left-skewed data.000 8.000 17.000 Monthly salary.000 20.000 9.000 16.000 14.000 14.000 10.000 21.000 12.000 10.000 12.000 24.000 21.000 11.000 13.000 23.000 18. $ .000 20.000 22.000 8.000 19.000 19.000 16.000 11.000 14.000 24.000 10. 26% 24% 22% 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% 8.000 17.000 9.000 14.000 11.000 13.000 16.000 12.000 22.000 20.000 22.000 19.000 23.000 23.15 Frequency polygon and its box-and-whisker plot for right-skewed data.000 9.000 15.000 21.000 19.000 22.000 Monthly salary.000 18.000 11.000 18.000 23.000 21.000 13.

6745 0.0000 0.17. For example.00 70.00 40.3. in the column “z”.4 standard deviations. Observe the profile of the graph. Note that the value of z has the same numerical values moving from left to right and at the median.30%.00 10.1257 0.00 85.8416 1.00 45.Chapter 5: Probability analysis in the normal distribution $19. z is 0 since this is a standardized normal distribution. median ($19. if there are 19 data points in the array then the curve has 20 portions.00 50.00 80. This procedure is as follows: ● ● ● ● ● Organize the data into an ordered data array.6745 0.752 or a smaller range of upper values compared to the symmetrical distribution. The third The three normal probability plots that show clearly the profiles for the normal. and left-skewed datasets for the consulting data of Figures 5. . For example. which gives z for a given probability. Plot the data values on the y-axis against the z-values on the x-axis.00 30.00 55. “percentile” gives the area to the left of this number of standard deviations.00 20. The next column. which we have demonstrated in the paragraph “Demonstrating that data follow a normal distribution” in this chapter.2816 1.) Determine the number of standard deviations.00 35.5244 0. For each of the data points determine the area under the curve on the assumption that the data follows a normal distribution. we show the number of standard deviations going from 3.500). of standard point data point (%) deviations at data point 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 5.1 and then to position regional sales information according to its percentile value.2816 1. we used percentiles to divide up the raw sales data originally presented in Figure 1. right-.2533 0. Data Area to left of No. for each area using that normal distribution function in Excel.0364 1. (To divide a segment into n portions you need (n 1) limits.6449 Testing symmetry by a normal probability plot Another way to establish the symmetry of data is to construct a normal probability plot. In Table 5.1257 0. we can relate the percentile value and the number of standard deviations.001).207).500 with the frequency at about 24.6449 1.5244 0.2533 0. z. for 19 data values Table 5.3853 0.00 1. Thus in ascending order we have the mean ($18.14–5.001 and $21. which is also the percentile value on the basis the data follows a normal distribution. “Testing symmetry by a normal probability plot”.00 60.00 65.00 75.2 gives the area under the curve and the corresponding value of z. Using the concept from the immediate previous paragraph. and the mode ($20. Percentiles and the number of standard deviations In Chapter 2. If the graph is essentially a straight line with a positive slope then the data follows a normal distribution.00 15. From the graph the mode is about $20. 167 Table 5.00 95.00 90.3853 0.00 25.4 to 3. If the graph is non-linear of a concave format then the data is right-skewed. which explains the higher mean value.8416 0.16 are shown in Figure 5. If the graph has a convex format then the data is left-skewed.2 Symmetry by a normal probability plot.0364 0.

5422 69.9276 99.5070 z 1.7864 2.9032 99.00 2.6800 11.20 1.307 117.241 47.717 35.3200 91.50 1.8198 1. 24.685 157.722 150.1462 72.0000 0.50 0.864 97.60 2.264 184.8134 99.5435 96.882 49.5666 15.00 1.20 Positioning of sales data.8650 99.106 84.000 4.587 76.9663 Value ($) 141.70 1.3467 0.0000 Number of standard deviations Right .30 2.20 2.6210 0.677 39.7250 98.544 35.8036 78.754 66.20 2.304 175.0000 53.053 170.0000 2.581 165.4070 97.000 Salary 16.00 0.90 0.70 2.4799 6.931 .734 78.40 0.2555 0.5339 99.246 152.7911 65.9313 99. Value ($) 35.882 93.0483 0.4060 21.70 2.568 133.3903 1.487 89.485 45.80 0.30 0.728 181.333 56.0000 1.684 184.skewed 2.0724 1.9243 93.40 Percentile (%) 88.0968 0.851 184.8717 3.80 0.10 Percentile (%) Value ($) 13.9517 99.4565 5.30 1.5930 4.0757 9.8655 18.626 112.0740 46.skewed Table 5.591 184.919 184.8145 81.40 1.3 z 3.810 184.30 0.0172 50.2136 98.20 0.000 12.0000 1.2750 2.90 3.778 59.20 0.522 z 1.80 1.987 105.038 163.5747 75.10 2.60 1.80 2.6097 98.976 71.1283 97.616 35.30 2.90 2.502 128.000 14.50 0.390 63.5201 95.9828 57.90 1.293 162.161 135.4253 30.80 2.00 0.1350 0.60 0.70 0.903 184.088 37.4930 90.835 164.30 3.30 3.00 3.40 0.30 1.5940 84.562 61.90 2.10 1.638 37.17 Normal probability plot for salaries.60 1.2089 42.00 1.4578 38.1345 86.0000 Normal distribution Left .70 0.585 42.298 36.90 1.000 22.6807 8.756 184.724 82.296 102.548 47.40 1.40 2.881 184.597 Percentile (%) 0.005 54.70 1.0337 0.535 100.949 87.20 3.40 3.000 20.469 145.80 1.20 3.0000 3.9260 61.0000 3.1855 24.1964 27.697 58.50 2.855 36.1866 0.10 0.986 169.0000 4.60 0.50 2.044 36.4661 0.260 108.090 73.40 2.532 121.1802 99.3193 94.291 138.000 8.50 1.0687 0.10 3.10 2.4334 68.00 2.60 2.3790 99.10 0.7445 99.10 3.168 Statistics for Business Figure 5.072 52.6533 99.682 124. according to z and the percentile.000 18.000 10.487 166.8538 34.

Similarly a value of z 1 puts the sales at 103 31 $72 thousand. and the probability of success. $” is the sales amount corresponding to the number of standard deviations and also the percentile. Using the binomial distribution. σ σ2 (np(1 p)) (npq) 169 When the two normal approximation conditions apply. and fire ceramic plates.666. μx E(x) np And from equation 4(xvii) the standard deviation of the binomial distribution is given by. The quality control manager takes a random sample of 500 of these plates and inspects them. the sample size.1 the standard deviation. glaze. using the standard z-values we have a measure of the dispersion of the data. . “Value.67 (let’s say $103 thousand). the discrete binomial distribution can be approximated by the continuous normal distribution. for this sales data is $30. Using equations 5(iv) and 5(v) np n(1 p) 500 * 0. enabling us to perform sampling experiments for discrete data but using the more convenient normal distribution for analysis. From Table 5. 500 * 0.3 the value is $71 thousand which again is close. Can we use the normal distribution to approximate the normal distribution? The sample size n is 500. p. the mean or expected value of the binomial distribution is. From Chapter 4. That is. the characteristic probability p is 3%. and the cumulative value is 0. This gives the probability of exactly 20 plates being defective of 4. Under certain conditions.3 the value is $135 thousand (rounding).03) 5 Using a Normal Distribution to Approximate a Binomial Distribution In Chapter 4. equation 4(xv). z 1. np n(1 p) 5 5 5(iv) 5(v) 485 and again a value Thus both conditions are satisfied and so we can correctly use the normal distribution as an approximation of the binomial distribution. 1. From Table 5. 2.03 500 * (1 15 or a value 0. This is particularly useful for example in statistical process control (SPC).Chapter 5: Probability analysis in the normal distribution column.97 5 Conditions for approximating the binomial distribution The conditions for approximating the binomial distribution are that the product of the sample size. or a negligible difference. is greater. [function BINOMDIST] where x is 20.20 (let’s say $31 thousand) and the mean 102. It knows from historical data that in the operation 3% of the plates are defective and have to be sold at a marked down price. we presented the binomial distribution. what is the probability that 20 of the plates are defective? Here we use in Excel. What does all these mean? From Table 5. Thus if sales are 1 standard deviations from the mean they would be approximately 103 31 $134 thousand. Thus.888.16%. This is another way of looking at the spread of information. and the probability p is 3%. Application of the normal–binomial approximation: Ceramic plates A firm has a continuous production operation to mould. n is 500. or equal to five and at the same time the product of the sample size and the probability of failure is also greater than or equal to five. n. then from using equation 5(ii) substituting for the mean and standard deviation we have the following normal–binomial approximation: x μx x np x np z 5(vi) σx npq np(1 p) The following illustrates this application.

18 Continuity correction factor.170 Statistics for Business 3.5 15 14. Frequency of occurrence 15 19.5 20.003 15 Continuity correction factor Now.8144 (500 * 0.03 500 * 0.5) on the upper side.8144 1. In the previous ceramic plate example.5 500 * 0.43%.997) Here we use in Excel. and is shown by a line graph.5 x .5 (20 0. Another way to make the normal– binomial approximation is to apply a continuity correction factor so that we encompass the range of the discrete value recognizing that we are superimposing a histogram to a continuous curve. (Note if we had used a cumulative value 1 this would give the area from the left of the normal distribution curve to the value of x. Using equation 5(vi) for these two values of x gives x1 np(1 np p) 19.5 3.16% obtained in Question 2. This gives the probability of exactly 20 plates being defective of 4. and the cumulative value is 0. σ (npq) 3. if we apply a correction factor of 0.5) and x2 20.5 (20 0. the mean value is 15. whereas the binomial distribution is discrete illustrated by a histogram. [function NORMDIST] where x is 20. the mean value of the binomial distribution is.18. Using the normal–binomial approximation what is the probability that 20 of the plates are defective? From equation 4(xv). the random variable x then on the lower side we have x1 19. This is a value not much different from 4.03(1 0. The concept is illustrated in Figure 5.8144.03) 4.) z1 19. the normal distribution is continuous.55 Figure 5. the standard deviation is 3.5–20.1797 From equation 4(xvii) the standard deviation of the binomial distribution is.003 * 0. μx np 500 * 0.

for values of p from 10% to 90% in order to satisfy both equations 5(iv) and 5(v). This value is again close to those values obtained in the worked example for the ceramic plates.53% 88. n. conversely n(1 p) is small.19 Minimum sample size in a binomial situation to be able to apply the normal distribution assumption.55 Using in Excel.09%.1797 gives the area under the curve from the left to x 19.Chapter 5: Probability analysis in the normal distribution large then for a given value of n the product np is large. If for example p is 99%.5 15 14.5 500 * 0.53%. p.09%). When p is Figure 5.5 3. (1 p) decreases and so for the two conditions to be valid the sample size n has to be larger. [function NORMSDIST] for a z-value of 1.5 of 88.5 gives the area under the curve of 92.03) 5. Figure 5. Sample size to approximate the normal distribution The conditions that equations 5(iv) and 5(v) are met depend on the values of n and p.44% (92.03 500 * 0.03(1 0. 55 50 45 40 Sample size. In this case the probability. then the minimum sample size in order to apply the normal distribution assumption is 500 illustrated as follows: p (1 p) 99% and thus. The difference between these two areas is 4. n(1 500 * 99% p) 500 * 1% 495 5 171 z2 x2 np(1 np p) 20. n (units) 35 30 25 20 15 10 5 0 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% Probability.19 gives the relationship of the minimum values of the sample size.4419 20. As the probability p increases in value. must be equal to 50% as for example in the coin toss experiment. For a value of x of 20. The minimum sample size possible to apply the normal– binomial approximation is 10.8144 1. np 1% and thus. p .

that the data range is about six times the standard deviation. We can develop a stemand-leaf display if the dataset is small. When we know the values of the mean value.33 times the standard deviation. The importance of knowing these shapes is that a curve that is leptokurtic is more reliable for analytical purposes. Data in a normal distribution can be uniquely defined by its mean value and standard deviation and these values define the shape or kurtosis of the distribution. For larger datasets we can draw a box-and-whisker plot. and when the mean is less than the median the distribution is negatively or left-skewed. A more rigorous test of symmetry involves developing a normal . the number of standard deviations from the mean corresponding to the area under the curve. and 99. the filling of yogurt pots. σ. or plot a frequency polygon. mode.44% of the data falls within 2 standard deviations from the mean. Visually. Demonstrating that data follow a normal distribution To verify that data follows a normal distribution there are several tests. or the two tails of the normal distribution. The left and right extremities. of a dataset we can transform the absolute values of the dataset into standard values. may extend far from the central point.73% of data is 3 standard deviations from the mean. or randomness of these operations. or the pouring of liquid chocolate into a mould. and see if these displays are symmetrical. Situations which might follow a normal distribution are those processes that are set to produce products according to a target or a mean value such as a bottle filling operation. These empirical relationships allow the normal distribution to be used to determine probability outcomes of many situations. that the inter-quartile range is equal to 1. median. The central point of the hump is at the same time the mean. μ. In addition. we will find volume or weight values below and above the set target value.172 Statistics for Business Chapter Summary This chapter has been entirely devoted to the normal distribution. Simply because of the nature. No matter the value of the mean or the standard deviation. 95. and midrange. we can determine the properties of the data to see if the mean is about equal to the median. and the random variable. When the mean is greater than the median the distribution is positively or right-skewed. This then gives us a standard normal distribution which has a mean value of 0 and plus or minus values of z.26% of all the data falls within 1 standard deviations from the mean. 68. A distribution between these two extremes is mesokurtic. If the mean and median value in a dataset are significantly different then the data is asymmetric or skewed. Describing the normal distribution The normal distribution is the most widely used analytical tool in statistics and presents graphically the profile of a continuous random variable. Additionally. A distribution that has a large standard deviation relative to its mean has a flat peak and is platykurtic. and that the empirical rules governing the number of standard deviations and the area under the curve are respected. x. the area under the curve of the normal distribution is always unity. a normal distribution is bell or hump shaped and is symmetrical around this hump such that the left side is a mirror image of the right side. the standard deviation. A distribution that has a small standard deviation relative to its mean has a sharp peak or is leptokurtic.

the normal probability plot is related to the data percentiles. If the plot is non-linear and concave then the data is right-skewed. p. n. This condition applies for a minimum sample size of 10 when the probability of success is 50%. This normal–binomial approximation has practicality in sampling experiments such as statistical process control. and if it is convex then the data is left-skewed. then the data is normal. and probability. of success and the product of sample size and probability of failure (1 p) are greater or equal to five then we can use a normal distribution to approximate a binomial distribution. A normal distribution to approximate a binomial distribution When both the product of sample size. Since we have divided data into defined portions.Chapter 5: Probability analysis in the normal distribution 173 probability plot which involves organizing the data into an ordered array and determining the values of z for defined equal portions of the data. For other probability values the sample size will be larger. If the normal probability plot is essentially linear with a positive slope. .

3. a division of Volvo Sweden. What percentage of these calls lasted no more than 180 seconds? What is the probability that a particular call lasted between 180 and 300 seconds? How many calls lasted no more than 180 seconds and at least 300 seconds? What percentage of these calls lasted between 110 and 180 seconds? What is the length of a particular call. Renault trucks Situation Renault Trucks.000 trucks in the analysis. Does Renault Trucks reach this objective? Justify your answer by giving the distance at which at least 75% of the trucks travel. is 150.90% of the trucks are expected to travel? 7. What percentage of trucks can be expected to travel no more than 50. The data is essentially normally distributed.000 km per year? 2.000 km per year? 3.000 km. Required 1. It is interested in the performance of its Magnum trucks that it sells throughout Europe to both large and smaller trucking companies. What is the probability that a randomly selected truck travels between 72. What is the distance below which 99.000. and 200. 4. are expected to travel between 125. For analytical purposes for management.000 and 150. and a standard deviation of 40 seconds. Based on service data throughout the Renault agencies in Europe it knows that on an annual basis the distance travelled by its trucks. 5. Required 1.000 km per year and at least 190.000 km per year? 4. develop a greater than ogive based on the data points developed from Questions 1–6.000 km. is a manufacturer of heavy vehicles. 6. Telephone calls Situation An analysis of 1. In order to satisfy its maintenance and quality objectives Renault Trucks desires that at least 75% of its trucks travel at least 125.174 Statistics for Business EXERCISE PROBLEMS 1. and there were 62. 2. before a major overhaul is necessary. 2. with an average time of 240 seconds.000 and 140.000 km with a standard deviation of 35. such that only 1% of all calls are shorter? . What proportion of trucks can be expected to travel between 82.000 km in the year? 5.000 long distance telephone calls made from a large business office indicates that the length of these calls is normally distributed. How many of the trucks in the analysis.

25 g. What are the upper and lower limits in days within which 80% of the employees will successfully complete the programme? 4. What is the probability that an employee will successfully complete the programme between 40 and 51 days? 2. During the last several months. When the employee passes the examination they are considered competent with the ERP system and they immediately receive a 2% salary increase. What is the probability that an employee will take at least 75 days to complete the training programme? 5. what is the probability that your packet will contain less than the nominal indicated weight of 125 g? 3. 95% will contain at least how much in weight? . What is the probability an employee will successfully complete programme in 35 days or less? 3. Tests at the production site indicate that the average weight in a package is 126. If you buy a packet of these cashew nuts at a store. has been 56 days. If you buy a packet of these cashew nuts at a store. If they fail the examination they are able to retake it as many times as they wish in order to pass. Required 1. This training programme has a fixed lecture period and at the end of the programme there is a self-paced on-line practical examination that the participants have to pass before they are considered competent with the new ERP system. what is the probability that your packet will contain more than 127 g? 2. with a standard deviation of 14 days. Required 1. What is the combined probability that an employee will successfully complete the programme in no more than 34 days or more than 84 days? 4. The human resource department has been instructed to develop a training programme for the employees to fully understand how the new system functions. The time taken to pass the examination is considered to follow a normal distribution. which includes passing the examination. In the packets of cashew nuts. Cashew nuts Situation Salted cashew nuts sold in a store are indicated on the packaging to have a nominal net weight of 125 g.Chapter 5: Probability analysis in the normal distribution 175 3.75 g with a standard deviation of 1. What is the minimum and maximum weight of a packet of cashew nuts in the middle 99% of the cashew nuts? 4. average completion of the programme. Training programme Situation An automobile company has installed an enterprise resource planning (ERP) system to better manage the firm’s supply chain.

to distribution is 3. From the manuscripts she is promised to receive this year for publication. She also knows that from past publishing data a normal distribution represents the distribution time for publication. . approximately how many can Cathy expect to publish within the year? 5. approximately how many can Cathy expect to publish within the first quarter? 2. publication. In a certain year she is told that she will receive 19 manuscripts for publication.000 litre of diesel oil per day. the gas station sells at least 5. on average 5.176 Statistics for Business 5. approximately how many can Cathy expect to publish within the third quarter? 4. Required 1. 10. The assumption is that the sale of diesel oil follows a normal distribution. From the manuscripts she is promised to receive this year for publication. The gasoline station is open 7 days a week and diesel oil deliveries are made once a week on Monday morning. If by the introduction of new technology. Based on passed information she knows that it requires on average. What is the probability that on a given day. From the manuscripts she is promised to receive this year for publication. the gas station sells between 4. the publishing house can reduce the average publishing time and the standard deviation by 30%.5 months to publish a book from receipt of manuscript from the author to getting the book on the market. What is the volume of diesel oil sales at which the sales are 80% more? 5. the gas station sells no more than 4. To what level should diesel oil stocks be replenished if the owner wants to be 95% certain of not running out of diesel oil before the next delivery? Daily demand of diesel oil is considered reasonably steady.850 litre? 3.180 litre? 2. Publishing Situation Cathy Peck is the publishing manager of a large textbook publishing house in England. Gasoline station Situation A gasoline service sells. how many of the 19 manuscripts could be published within the year? 6.24 months.700 and 5. What is the probability that on a given day.200 litre? 4. and that the standard deviation for the total process from review. Required 1. What is the probability that on a given day. The standard deviation of this sale is 105 litre per day. approximately how many can Cathy expect to publish within the first 6 months? 3. From the manuscripts she is promised to receive this year for publication.

The size distribution of the production of ping-pong balls is considered to follow a normal distribution. The filling machines are set to the nominal weight and the standard deviation of the filling operation is 3. What is the diameter of a ping-pong ball above which 75% are greater than this diameter? 6. The jars of marmalade are packed in cases of one dozen jars per case.75 mm. What are the symmetrical limits of the net weight between which 99% of the jars of marmalade lie? 7.25 g. Marmalade Situation The nominal net weight of marmalade indicated on the jars is 340 g. What proportion of cases will be above 4. What are the symmetrical limits of the diameters between which 90% of the pingpong balls would lie? 7.000 ping-pong balls in a production lot how many of them would have a diameter between 368 and 371 mm? 5. What is the combined percentage of ping-pong balls can be expected to have a diameter that is no more than 368 mm or is at least 371 mm? 4. Required 1. If there are 40. What is the net weight of jars of marmalade above which 85% are greater than this net weight? 6. What is the probability that the diameter of a randomly selected ping-pong ball is between 372 and 369 mm? 3. Ping-pong balls Situation In the production of ping-pong balls the mean diameter is 370 mm and their standard deviation is 0.Chapter 5: Probability analysis in the normal distribution 177 7. Required 1. What percentage of jars of marmalade can be expected to have a net weight between 335 and 340 g? 2. What is the combined percentage of jars of marmalade that can be expected to have a net weight that is no more than 333 g and at least 343 g? 4. What percentage of jars of marmalade can be expected to have a net weight between 335 and 343 g? 3. What percentage of ping-pong balls can be expected to have a diameter that is between 369 and 370 mm? 2.000 jars of marmalade in a production lot how many of them would have a net weight between 338 and 345 g? 5. What can you say about the shape of the normal distribution for the production of ping-pong balls? 8.1 kg in net weight? . If there are 25.

56 Variance 1. had the following data regarding the time taken to service clients.24 10. Again. with a standard deviation of 45. A certain restaurant in New York. Activity Showing to table.7344 378. Prices for the yogurt were then determined based on that forecast.200 customers will come to the restaurant in the next month. to the nearest whole number. Yoghurt Situation The Candy Corporation has developed a new yogurt and is considering various prices for the product. could be approximated by a normal distribution. in a 3-month study. Thus.1025 5. What is the probability that a customer can be serviced between 90 and 125 minutes? 3.4225 0. from showing to the table and seating. what is a reasonable estimate of the number of customers that can be serviced between 70 and 140 minutes? 6.86 3. What is the combined probability that a customer can be serviced in 70 minutes or less and at least 140 minutes? 5.350 cartons.7744 Required 1. . What is the probability that a customer can be serviced between 70 and 140 minutes? 4.14 7.54 2. It believed that it was reasonable to assume that the time taken to service a customer.178 Statistics for Business 9. Restaurant service Situation The profitability of a restaurant depends on how many customers can be served and the price paid for a meal. on the basis that 1. a restaurant should service the customers as quickly as possible but at the same time providing them quality service in a relaxed atmosphere.45 82. A later revised estimate from marketing was that average daily sales would be 2.3025 3.200 customers will come to the restaurant. If in the next month it is estimated that 1. and seating client Selecting from menu Waiting for order Eating meal Paying bill Getting coat and leaving Clearing table Average time (minutes) 4. What is the average time and standard deviation to serve a customer such that the restaurant can then receive another client? 2.0625 9.21 14. to clearing the table after the client had been serviced.0625 0. Marketing developed an initial daily sales estimate of 2.400 cartons. 85% of the customers will be serviced in a minimum of how many minutes? 10.

Activity Mean time (minutes) 20 30 15 Standard deviation (minutes) 4 7 3 Dismantling Testing and adjusting Reassembly .76 m. What is the probability of a steel rod from this inventory stock.400? 11. Motors Situation The IBB Company has just received a large order to produce precision electric motors for a French manufacturing company. what proportion of the visitors would have to stoop when going through the door? 13. To fit properly. If after consideration.Chapter 5: Probability analysis in the normal distribution 179 Required 1. Machine repair Situation The following are the three stages involved in the servicing of a machine. meeting the drive shaft specifications? 12. and a standard deviation of 12 cm. what height should the doors be made to the nearest cm? 2.2 0. what is the probability that a day’s sale will be at least 98% of 2.05 cm. what is the probability that a day’s sale will still be over 2. According to the revised estimate. the officials decided to make the door 2 cm higher than the value obtained in Question 1. Statistics indicate that the adult height is normally distributed. Based on the design criterion. and a standard deviation of 0.18 cm.400 given that the standard deviation remains the same? 2. The door opening for the crypt is small and the church officials want to enlarge the opening such that 95% of visitors can pass through without stooping. the drive shaft must have a diameter of 4. Required 1. Doors Situation A historic church site wishes to add a door to the crypt. Required 1. According to the revised estimate.06 cm. with a mean of 1. The production manager indicates that in inventory there is a large quantity of steel rods with a mean diameter of 4.

testing and adjusting.5 9. What is the probability that an allowed time of 75 minutes will be sufficient to complete the servicing of the machine including dismantling. with a standard deviation of 171 days.8 7. averages 17 months.7 13.5 12.3 12.9 9. An analysis of past data indicates that the life of a regular savings account. maintained at its branch. for one particular month. What is the probability that the account will have been closed within 2 years? 3. What is the probability that the reassembly activity alone will take between 13 and 18 minutes? 4.4 9.1 9.3 12.7 9. Savings Situation A financial institution is interested in the life of its regular savings accounts opened at its branch.6 11. is considering purchasing the total 50 retail stores belonging to Hardway. a grocery chain in the Greater London area of the United Kingdom.180 Statistics for Business Required 1.6 10.1 12.8 11. Buyout – Part III Situation Carrefour. For calculation purposes 30 days/month is used. are as follows.5 10.8 10. France. This information is of interest as it can be used as an indicator of funds available for automobile loans. in £ ’000s.3 13. What is the chance an account will be open in 3 years? 15.6 11.7 11. What is the probability that the dismantling time will take more than 28 minutes? 2. What is the probability that the account will still be open in 2.5 years? 4.1 12.5 11.7 12.9 9.7 8.) 8.7 11. Required 1.3 11.1 11.6 10.4 5.1 6. The profits from these 50 stores.2 10.5 14.5 7.6 9. What is the probability that the testing and adjusting activity alone will take less than 27 minutes? 3.5 7.7 10. what is the probability that there will still be money in that account in 20 months? 2.2 8.2 15.9 6.1 11.5 10.6 8. The distribution of this past data was found to be approximately normal. If a depositor opens an account with this savings institution. (This is the same information as provided in Chapters 1 and 2.5 10.3 9.8 8.3 10. and assembly? 14.3 10.5 .5 7.7 10.

at about 80°C. from a production run of 115.56 128. Afterwards these trays move along a conveyer belt during which the chocolate cools and hardens taking the shape of the mould.08 107.94 78.15 96.95 117.29 104.73 104.82 95.50.66 86.97 114. The production cost for these 100 g chocolate bars is £0. These nozzles are set to inject a little over 100 g of chocolate into flat trays which pass underneath the nozzles.33 105.69 127.11 110.08 102.08 73.09 115.33 92. the moulds are turned upside down through a reverse system on the belt after which the belt vibrates slightly such that the chocolate bars are ejected from the mould. A printout of the weights for a sample of 1. What are your conclusions from the answers determined from both methods.19 106.39 104.01 94. Required From the statistical sample data presented.25 102.29 87.) 2.03 89.71 96.47 92.68 99.000/hour. The final part of this production line is where the individual bars of chocolate are packed in cardboard cartons.28 104.55 88. Carrefour management decides that it will purchase only those stores showing profits greater than £12.99 81.90 100. how would you describe this operation? What are your opinions and comments? 109.70 106.000 bars.12 107.Chapter 5: Probability Analysis in the Normal Distribution 181 Required 1.16 106.81 137.96 112.03 91. On the basis that the data follow a normal distribution.47 95.20/unit.83 94.91 94.96 100.22 89.51 110.09 111.28 97.98 108. Case: Cadbury’s chocolate Situation One of the production lines of Cadbury Ltd turns out 100-g bars of milk chocolate at a rate of 20.20 120. How does the answer to Question 1 compare to the answer to Question 6 of buyout in Chapter 1 that you determined from the ogive? 3.06 103.30 105.01 85.93 118.61 97. to a battery of 10 injection nozzles.27 107.48 107.30 109.96 104.65 98.66 109.37 88.000 units is given in the table below.36 111. The next production stage is the packing process where the bars are first wrapped in silver foil then wrapped in waxed paper onto which is printed the product type and the net weight. They are sold in retail for £3. In this cooling process some of the water in the chocolate evaporates in order that the net weight of the chocolate comes down to the target value of 100 g.54 77. Immediately upstream of the start of the packing process. At about the middle of the conveyer line.67 94.38 96.80 114. The start of this production line is a stainless steel feeding pipe that delivers the molten chocolate.18 87.73 110.500.73 116.22 82.08 . calculate how many of the Hardway stores Carrefour would purchase? (You have already calculated the mean and the standard deviation in the Exercise Buyout – Part II in Chapter 2.12 107.76 94.77 94. 16.64 98.18 105.39 81.17 96.78 86.25 125. the bars of chocolate pass over an automatic weighing machine that measures at random the individual weights.33 113.47 100.65 117.

66 106.10 74.11 119.96 84.96 75.75 110.19 91.42 109.75 99.91 74.03 92.53 97.08 105.58 98.79 112.49 90.03 93.92 114.28 109.81 108.12 104.32 104.96 100.82 122.35 106.84 111.18 97.09 106.82 93.72 112.30 122.91 98.23 75.12 82.49 116.53 80.35 104.57 97.13 116.46 111.13 109.41 94.36 104.22 110.23 84.65 101.06 107.56 103.67 92.61 104.53 122.05 117.35 100.85 113.75 93.53 98.63 93.85 118.00 107.86 79.72 91.85 87.23 96.45 84.45 91.02 107.15 100.10 92.77 110.09 116.56 103.70 123.20 113.44 98.47 111.99 110.28 90.54 110.56 98.20 114.59 97.65 84.34 97.78 101.68 81.62 131.94 102.58 99.30 88.00 84.15 85.20 102.96 113.75 102.96 116.70 84.19 88.06 77.53 115.61 90.60 95.38 90.27 88.22 123.78 107.39 98.48 115.61 96.46 94.54 92.96 113.87 96.39 116.72 76.99 115.43 98.73 94.87 79.49 111.31 .16 94.35 93.84 122.32 110.08 96.24 111.94 86.99 115.17 105.15 102.58 76.03 88.64 113.72 89.46 119.13 92.65 101.58 83.93 83.02 98.64 92.71 100.55 103.73 107.80 111.41 112.74 101.54 95.37 86.18 112.91 96.57 113.80 101.37 126.40 84.82 98.47 107.68 107.48 91.56 105.61 114.40 123.58 106.86 110.64 72.99 82.22 94.54 87.92 99.83 109.30 118.82 85.02 86.20 96.03 115.70 100.19 118.97 89.81 120.74 81.89 89.37 115.55 86.84 116.37 106.22 105.15 87.08 125.13 93.85 130.41 97.69 102.15 85.66 94.75 105.22 94.42 97.22 94.22 91.96 104.41 96.74 108.24 78.26 94.51 131.94 102.35 132.16 95.51 102.86 130.71 113.06 90.75 98.72 100.87 100.58 80.23 88.22 76.78 110.59 93.09 122.65 100.58 105.50 104.38 101.56 100.73 92.75 108.25 103.84 101.70 108.46 100.05 96.96 115.36 128.25 101.80 88.29 105.66 93.96 78.36 115.04 123.81 105.78 131.97 104.48 97.12 98.30 105.80 87.21 97.82 106.48 70.27 105.94 79.18 92.69 103.82 103.77 86.35 112.84 111.46 107.99 113.62 107.91 89.56 98.76 115.39 90.01 96.46 91.70 88.81 92.30 111.56 85.16 108.95 96.91 111.13 80.33 108.24 78.76 103.01 100.49 108.15 76.03 69.58 92.47 95.54 118.26 110.59 92.46 78.78 91.09 100.61 112.36 115.23 98.67 87.70 113.74 100.31 105.06 111.94 103.89 104.72 109.43 96.37 106.09 79.09 126.01 86.50 127.88 112.20 94.03 120.76 95.06 100.15 95.08 101.71 101.54 102.92 110.73 109.80 106.28 98.70 93.80 108.182 Statistics for Business 117.53 114.41 102.18 83.85 113.06 92.83 106.47 93.79 88.61 98.19 99.68 104.94 93.75 101.34 71.09 124.46 119.08 90.32 89.88 110.34 114.23 125.04 122.46 77.46 108.58 87.65 99.53 120.79 102.37 112.47 99.74 107.10 79.29 92.66 104.43 83.82 112.74 121.99 91.15 124.76 116.40 93.82 120.54 97.87 100.09 114.01 88.34 102.16 98.96 105.21 117.04 107.22 100.74 94.00 104.72 101.50 106.03 118.33 85.40 121.97 69.80 110.64 112.06 89.66 91.20 96.62 74.88 99.82 81.28 107.58 104.09 97.03 88.20 99.63 100.89 104.86 91.39 86.82 101.72 93.08 118.44 109.89 115.61 109.82 103.60 90.53 111.22 96.53 100.32 85.66 85.33 89.40 110.15 109.22 93.89 90.99 99.65 107.95 84.15 109.22 86.43 114.87 102.94 93.62 94.28 123.72 84.58 87.97 83.77 101.61 68.22 103.46 96.93 117.82 116.35 115.96 108.15 98.01 88.41 107.03 85.66 93.40 86.11 122.62 111.72 100.34 110.17 112.23 92.44 100.26 100.96 100.26 93.22 97.12 76.89 102.71 90.87 100.00 75.21 107.54 91.59 107.20 87.92 133.18 111.08 89.25 91.70 83.64 79.39 91.74 100.14 102.12 107.64 100.48 118.46 105.71 94.39 89.01 108.71 80.15 107.18 107.63 93.81 100.81 97.11 85.65 111.27 80.59 108.85 87.89 98.42 99.78 98.57 84.79 104.41 105.44 99.

00 91.84 104.38 84.08 119.23 108.56 102.45 113.23 89.19 72.94 117.63 88.42 84.37 88.97 113.68 135.87 87.23 94.92 69.42 100.89 99.42 105.72 110.64 86.85 105.19 114.20 90.68 97.70 89.66 78.04 77.23 74.27 85.80 93.72 110.14 109.56 87.01 101.65 78.54 109.77 114.06 107.23 87.18 106.28 91.58 85.06 116.59 97.27 102.56 110.77 104.50 93.92 120.13 109.85 101.91 79.77 94.62 114.86 90.60 124.35 92.23 112.42 107.79 91.89 94.99 97.78 90.58 86.15 96.40 82.89 94.63 80.02 118.83 102.88 101.04 107.40 111.97 96.80 96.75 84.46 86.12 81.58 90.54 116.34 101.78 113.31 94.59 114.06 110.11 105.33 96.22 101.84 108.68 72.16 104.85 92.76 108.68 106.79 112.76 102.52 110.00 84.12 86.13 91.47 74.42 134.26 98.91 91.96 101.53 105.22 91.00 99.15 94.73 111.70 99.91 109.04 87.00 115.43 96.74 102.76 113.91 112.53 87.92 101.25 98.08 100.75 101.28 105.02 113.07 114.28 98.85 87.10 87.47 96.29 96.39 101.68 102.05 99.06 88.44 82.16 92.16 113.83 94.16 89.55 76.10 95.99 91.Chapter 5: Probability analysis in the normal distribution 183 84.58 82.00 129.65 103.67 101.11 89.27 109.91 98.39 93.62 101.19 102.01 105.25 118.68 85.44 124.90 114.95 94.48 105.46 123.28 93.48 78.27 105.12 90.99 95.47 99.31 102.24 84.44 123.32 105.65 104.83 105.13 90.42 113.42 111.70 95.20 94.22 108.57 99.70 74.35 122.96 94.86 87.46 98.53 101.39 113.71 105.42 98.92 92.75 115.70 97.61 108.92 93.56 64.73 90.02 72.80 100.27 104.57 103.61 106.59 117.82 95.42 .27 117.26 93.13 104.61 106.85 88.92 99.97 97.75 99.43 113.67 124.22 94.53 95.08 103.25 103.15 107.91 84.84 90.18 109.52 97.15 112.09 97.32 103.69 105.77 101.87 110.46 91.49 106.35 116.63 102.54 99.11 111.25 124.83 101.30 136.42 109.76 94.94 100.65 104.09 125.41 128.38 100.82 88.52 98.44 103.35 90.08 86.36 99.05 91.90 79.56 99.15 96.71 90.48 85.47 86.57 96.72 88.48 100.29 101.85 98.08 94.96 116.57 110.78 85.53 106.38 104.56 80.88 108.84 84.56 93.32 93.94 96.14 104.49 100.64 109.82 93.83 96.24 105.50 84.15 87.48 80.89 98.15 114.59 87.71 113.85 118.31 101.50 110.39 114.81 112.06 98.99 100.86 96.21 118.33 92.72 93.56 109.73 83.61 101.21 115.27 81.08 93.30 97.66 98.94 98.70 103.61 109.18 99.92 115.13 98.07 91.12 114.27 101.61 96.86 113.45 99.58 86.72 104.25 84.60 100.69 103.23 106.27 110.26 96.91 84.25 90.20 128.68 92.12 107.34 106.82 97.27 92.44 103.15 108.13 94.67 114.66 112.42 87.40 105.48 103.56 97.23 91.37 95.24 112.55 107.28 113.00 104.91 109.52 118.22 78.32 118.95 93.44 108.17 105.72 108.93 103.53 106.08 116.22 92.31 91.74 94.24 113.70 123.44 103.72 95.97 83.58 100.27 111.36 81.40 93.11 110.09 97.01 110.53 93.80 100.15 101.63 82.32 83.53 101.35 110.89 118.56 96.58 119.72 90.46 93.49 99.80 109.72 114.66 96.30 93.91 124.92 95.72 104.63 99.78 114.37 112.06 86.37 94.72 80.18 86.48 101.79 107.91 95.00 112.53 86.97 111.85 112.61 98.44 97.65 84.41 110.59 118.07 111.78 92.

This page intentionally left blank .

2 1 . The Chicago Tribune was “so sure” of the outcome that the headlines in their morning daily paper of 3 November 1948 as illustrated in Figure 6. the Republican candidate.5% of the popular vote to Dewey’s 45% and with an electoral margin of 303 to 189. 371–372. F. The Chicago Tribune had egg on their face. pp. (1982). desires. the Democratic incumbent and Governor Dewey of New York. McGraw Hill. and Brinkley. However. or needs of a population. 3 November 1948. New York. “Dewey defeats Truman”. A. America in the Twentieth Century. A classic illustration of sampling gone wrong was in 1948 during the presidential election campaign when the two candidates were Harry Truman.. 5th edition.1.Theory and methods of statistical sampling 6 The sampling experiment was badly designed! A well-designed sample survey can give pretty accurate predictions of the requirements. In fact Harry Truman won by a narrow but decisive victory of 49. announced. the accuracy of the survey lies in the phrase “well-designed”. something went wrong with the design of their sample experiment!1.2 Chicago Daily Tribune. Freidel.

.1 Harold Truman holding aloft a copy of the November 3rd 1948 morning edition of the Chicago Tribune.186 Statistics for Business Figure 6.

how many months do we date our future spouse before we decide to spend the rest of our life together! Statistical Relationships in Sampling for the Mean The usual purpose of taking and analysing a sample is to make an estimate of the population parameter. based entirely on the analysis of this sample. the production line or the lot from where these samples are taken. we often make decisions based on limited data. As the sample size is smaller than the population we have no guarantee of the population parameter that we are trying to measure. and even in our personal life. but from the sample analysis. application. What we do is take a sample from a population and then make an inference about the population characteristics. when you order a bottle of wine in a restaurant. an important application of statistical analysis. For example. If we really wanted to guarantee our conclusion we would have to analyse the whole population but . And.Chapter 6: Theory and methods of statistical sampling 187 Learning objectives After you have studied this chapter you will understand the theory. The waiter would hardly let you drink the whole bottle before you decide it is no good! The United States Dow Jones Industrial Average consists of just 30 stocks but this sample average is used as a measure of economic changes when in reality there are hundreds of stocks in the United States market where millions of dollars change hands daily. the assumption is that the entire population. We call this inferential statistics. and practical methods of sampling. meet the desired specifications and so all the units can be put onto the market. In manufacturing. samples of people’s voting intentions are made and based on the proportion that prefer a particular candidate. the waiter pours a small quantity in your glass to taste. lots of materials. Sampling for the means for an infinite population • Modifying the normal transformation relationship • Application of sampling from an infinite normal population: Safety valves Sampling for the means from a finite population • Modification of the standard error • Application of sampling from a finite population: Work week Sampling distribution of the proportion • Measuring the sample proportion • Sampling distribution of the proportion • Binomial concept in sampling for the proportion • Application of sampling for proportions: Part-time workers Sampling methods • Bias in sampling • Randomness in your sample experiment • Excel and random sampling • Systematic sampling • Stratified sampling • Several strata of interest • Cluster sampling • Quota sampling • Consumer surveys • Primary and secondary data In business. we draw conclusions. If they do. or finished products are sampled at random to see if pieces conform to appropriate specifications. The topics are broken down according to the following themes: ✔ ✔ ✔ ✔ ✔ Statistical relations in sampling for the mean • Sample size and population • Central limit theory • Sample size and shape of the sampling distribution of the means • Variability and sample size • Sample mean and the standard error. assemblies. the expected outcome of the nation’s election may be presented beforehand. Based on that small quantity of wine you accept or reject the bottle of wine as drinkable. In political elections.

188 Statistics for Business in most cases this is impractical. Figure 6. Thus. 9 cm 6 cm 6 cm 5 cm 4 cm 3 cm 2 cm Here. x . is given by the relationship. An alternative to inferential statistics is descriptive statistics which involves the collection and analysis of the dataset in order to characterize just the sampled dataset. Combinations n! x!(n x)! 3(xvi) Sample size and population A question that arises in our sampling work to infer information about the population is what should be the size of the sample in order to make a reliable conclusion? Clearly the larger the sample size. we sample from the population first with a sample size of one.00 3.00 9.00 21 35 35 21 7 7 . This translates into a mean value of the length of the rods of 5 cm (35/7). Combinations 7! 3!(7 3)! 7! 3! * 4! 35 7 *6 *5* 4 *3* 2*1 3* 2*1* 4 *3* 2*1 If we increase the sample sizes from one to seven rods. To demonstrate the impact of the sample size.2 Seven steel rods and their length in centimetres.1. right through to seven. then from equation 3(xvi) the total possible number of different samples is as given in Table 6. or alternatively. If we take samples of these rods from the population. of the sample would be. For example.00 6. The total length of these seven rods is 35 cm (2 3 4 5 6 6 9). 4. – and 6 cm are selected. Rod number Rod length (cm) 1 2 3 4 5 6 7 Sample size. if we select a sample of size of 3. 2 4 3 6 4. of possible different samples 1 7 2 3 4 5 6 7 2. with replacement. the smaller is the risk of making an inappropriate estimate. then two. the number of possible different combinations from equation 3(xvi) is. The number of the rod and its length in centimetres is indicated in Table 6.00 4.00 6. x No. takes too long. Table 6.2.00 cm Table 6.2 Number of samples from a population of seven steel rods. the same rod not appearing twice in the sample. the greater is the probability of being close to estimating the correct population parameter.00 5. n is the size of the population. etc. then the mean length. as shown in Figure 6. three. then from the counting relations in Chapter 3.2. or is clearly impossible. Each time we select a sample we determine the sample mean value of the length of rods selected.1 Size of seven steel rods. or in this case 7 and x is the size of the sample. too costly. consider an experiment where there is a population of seven steel rods. For example. the possible combinations of rods that can be taken. if the sample size is 3 and rods of length 2.

20 4.67 6 6 9 7.67 2 3 9 4.25 3 4 6 6 4.3.33 2 5 6 4.50 3 5 6 6 5.67 2 6 9 5.00 5.00 3.40 5.00 2 4 5 6 4.00 4.00 4 5 6 6 5.50 105.75 3 6 6 9 6.00 .50 5.00 2 3 6 9 5.80 4.00 5. the sum of all the sample means is then divided by the number of samples withdrawn to give the mean value of the samples or.00 5.75 2 3 6 6 4.25 2 4 6 9 5.33 3 5 6 4.00 5.20 4.50 5.25 2 4 5 6 4.25 5 6 6 9 6.00 4.25 2 3 6 9 5.00 4 5 6 5.00 23456 23456 23459 23466 23469 23469 23566 23569 23569 23669 24566 24569 24569 24669 25669 34566 34569 34569 34669 35669 45669 Size 5 Mean 4.67 2 4 6 4.00 4.50 3 4 6 9 5.00 3 4 6 4.00 4 6 6 9 6.00 35.00 23 24 25 26 26 29 34 35 36 36 39 45 46 46 49 56 56 59 66 69 69 Mean 2.00 7 9 9.67 2 3 6 3.00 6 6 6.67 3 6 6 5.00 5.00 5 6 6. the sum of the sample means is 175 and 189 Table 6.60 5.00 4 5 5.50 4.60 4.25 2 4 5 9 5.00 5.00 5. (Note that there are two rods of length 6 cm. Size Samples of size 1 to 7 taken from a population of size 7.50 6.25 2 5 6 6 4.50 4.50 Size 7 Mean 2 3 4 5 6 6 9 5.) For a particular sample size.3 No.80 6.33 2 3 6 3.00 Mean 1 2 2.60 4.50 7.67 5 6 9 6.00 245669 235669 234669 234569 234569 234566 345669 Size 6 Mean 5.33 2 5 9 5.00 2 4 6 6 4.83 4. =.33 3 4 6 4.33 5.00 7.00 3 6 9 6.00 2 5 6 4.33 3 4 9 5.00 4 5 6 9 6.67 2 4 5 3. for a sample of x size 3.00 Sample = mean x 5.33 5.00 4 5 9 6.00 5.33 4 6 9 6.00 2 3 5 6 4.00 4.50 2 4 6 9 5.75 3 4 5 6 4.00 Size 4 Mean 2 3 4 5 3.00 6.67 3 5 6 4.50 2 3 4 6 3.40 5.50 2 5 6 9 5.20 5.40 5.00 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Total 35.33 2 6 6 4.00 2 3 5 9 4.00 2 3 5 3.00 4.Chapter 6: Theory and methods of statistical sampling The possible combinations of rod sizes for the seven different samples are given in Table 6.50 3 4 5 9 5.75 3 4 6 9 5.33 4 6 9 6.80 5.17 5.75 3 5 6 9 5.50 175.00 6.00 2 4 9 5.67 5 6 9 6.25 4 5 6 9 6.50 7.00 5.00 3 6 9 6.20 5.50 3 4 5 6 4.00 3 5 6 9 5.00 2 4 6 4.00 5.50 2 3 5 6 4.50 4.67 3 5 9 5. For example.00 5.50 3. 1 Size 2 Size 3 Mean 2 3 4 3.60 5.83 4.50 3.75 2 3 4 6 3.67 2 6 9 5.00 2 3 3.00 4.00 175.00 4 5 6 5.00 105.00 4 6 6 5.75 2 3 4 9 4.00 3 4 4.40 5.50 2 6 6 9 5.33 5 6 6 5.75 2 5 6 9 5.50 5.80 4.67 3 4 5 4.

=.50 4.50 6. the dispersion about the mean value of 5 cm becomes smaller or alternatively more sample means lie closer to the population mean.50 8. also called sampling distribution of the means. for a sample size of four there are four sample means greater than 4.25 7. is a probability distribution of all the possible means of samples taken from a population. a frequency distribution of mean length is determined.3. x .00 6. The distribution of the sample means.000 chocolate bars. or the whole population. the dispersion is zero. The moulding for the chocolate is set such that the weight of each chocolate bar should be 100 g.00 8.00 Total Sample size 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 7 2 0 0 1 0 1 0 2 0 3 0 3 0 2 0 3 0 2 0 1 0 1 0 2 0 0 0 0 0 0 21 3 0 0 0 0 1 0 1 3 3 0 4 4 4 0 3 4 3 0 2 2 1 0 0 0 0 0 0 0 0 35 4 0 0 0 0 0 0 1 2 2 3 4 3 4 4 4 3 3 1 1 0 0 0 0 0 0 0 0 0 0 35 5 0 0 0 0 0 0 0 0 2 1 1 2 5 3 3 2 2 0 0 0 0 0 0 0 0 0 0 0 0 21 6 0 0 0 0 0 0 0 0 0 0 1 0 3 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Central limit theory The foundation of sampling is based on the central limit theory.25 6.25 3. These values are given at the bottom of Table 6. which is the criterion by which information about a population parameter can be inferred from a sample. This concept of sampling and sampling means is illustrated by the information in Table 6. For example. For the sample size of seven. there becomes a point when the distribution of the sample means.9 where each of the seven histograms have the same scale on the x-axis. for each sample size. Here the production line is producing 500. This data is given in Table 6. Next. can be approximated by the normal distribution. This is so even though the distribution of the population itself may not necessarily be normal. The mean of the sample means. This experiment demonstrates the concept of the central limit theory explained in the following section.00 5. N. This data is now plotted as a frequency histogram in Figures 6. From Figures 6.50 5.3 to 6.50 2. is always equal to the population mean of x 5 or they have the same central tendency. Table 6.25 cm but less than or equal to 4.00 2. Sample = mean x 2.25 8.25 2.25 4.190 Statistics for Business according to the sample size.50 cm.75 8. or exactly the same as the population mean. and this is the population value.75 3.00 7.3 to 6.4 Frequency distribution within sample means for different sample sizes.5 for the production of chocolate. as the size of the sample increases.75 7. This is the nominal weight of the chocolate bar and is this number divided by the sample number of 35 gives 5.4. The left-hand column gives the sample mean and the other columns give the number of occurrences within a class limit . What we conclude is that the sample means are always equal to 5 cm.9 we can see that as the sample size increases from one to seven.25 5.50 7.50 3.75 9. The central limit theory states that in sampling.00 3.75 6.75 5.75 4.00 4.

00 Mean length of rod (cm) . 3 Frequency of this length 2 1 0 2. 6. 4 3 Frequency of this length 2 1 0 2. 00 8. 00 4. 50 7. 00 5. 50 9. 00 Figure 6. 50 0 4. 00 7. 50 00 00 50 5. 5.Chapter 6: Theory and methods of statistical sampling 191 Figure 6. 6. 50 6. 50 8. 50 9. 50 3. 00 2. 00 7. 00 5 2. 0 50 4. 0 3. 50 4. 0 0 3.4 Samples of size 2 taken from a population of size 7. 00 8. 50 8. 00 6. 00 3. 50 5. Mean length of rod (cm) 7.3 Samples of size 1 taken from a population of size 7.

00 8. 00 6. 6. 00 50 4. 50 3. 50 8. 50 0 3. 00 3. 50 9. 50 9. 50 7. 00 7. 00 Figure 6. 00 2. 5 4 Frequency of this length 3 2 1 0 2. 50 6. 50 4. 5 4 Frequency of this length 3 2 1 0 00 2. 00 7. 00 5. 50 4. 0 3. 50 00 00 50 5. Mean length of rod (cm) 7.5 Samples of size 3 taken from a population of size 7.192 Statistics for Business Figure 6. 50 8. 6. 5. 50 5. 00 Mean length of rod (cm) . 2. 00 8. 00 4.6 Samples of size 4 taken from a population of size 7.

50 4. 6. 5. 0 0 4. 50 3. 00 Figure 6. 00 4.Chapter 6: Theory and methods of statistical sampling 193 Figure 6. 00 7. Mean length of rod (cm) 7. 00 3. 00 5 3. 50 00 50 00 6. 0 50 4. 5. 0 3. 50 9. 50 9. 50 00 50 50 00 5. 6. 00 8. Mean length of rod (cm) 7. 2. 50 8. 6. 00 7.8 Samples of size 6 taken from a population of size 7. 4 3 Frequency of this length 2 1 0 00 2. 6 5 Frequency of this length 4 3 2 1 0 00 2. 50 8. 5. 5 2.7 Samples of size 5 taken from a population of size 7. 00 . 00 8.

μ. Each sample contains 15 chocolate bars. the sampling distribution of the means of samples taken at random from the population will be approximately normally distributed if samples of at least a size of 30 units each are withdrawn. and the mean weight of each sample.48 g. For quality control purposes an inspector takes 10 random samples from the production line in order to verify that the weight of the chocolate is according to specifications. x1. the sampling distribution of the means of samples taken at random from the population will be approximately normal if samples of at least 15 units are withdrawn.85 g. 0 0 3.9 Samples of size 7 taken from a population of size 7. 00 5 2. is 99. 1. if we consider sample No. 00 8. 0 5 3. Each bar in the sample is weighed and these individual weights. 0 4. – The mean weight of the 10th sample. regardless of their shape. Sample size and shape of the sampling distribution of the means We might ask. For example. = is 99. 50 8. what is the shape of the sampling distribution of the means? From statistical experiments the following has been demonstrated: ● ● ● For most population distributions. 50 50 00 50 00 6. are recorded. 00 the population mean. 5. and the weight of the 15th bar is 98. is 100. If the population is normally distributed. the weight of the 1st bar is 100.194 Statistics for Business Figure 6. the weight of the 2nd bar is 99. 2 Frequency of this length 1 0 2. Mean length of rod (cm) 7. The mean value of the means of all – the 10 samples. The values of x x plotted in a frequency distribution would give a sampling distribution of the means (though only 10 values are insufficient to show a correct distribution).56 g.88 g. 0 0 4. x10. the sampling distribution of the means of samples . 5. If the population distribution is symmetrical. 6. The – mean weight of this first sample.16 g. 00 7.02 g. 50 9.

12 101.35 99. .59 101.000 but closer than in the case of a single sample.23 8 101.61 99.61 99.27 101.34 101. The average of these is $75.23 98.3 99.5 101.77 98.5 100.84 100.54 101.000 and $90.1 100. Company is producing a lot (population) of 500.07 99.3 101.18 100.82 101.5 ● ● ● ● ● ● ● Sampling chocolate bars.85 taken at random from the population will be normally distributed regardless of the sample size withdrawn.000 [(60.04 100.47 98.98 98.44 101.24 100.79 99.93 99.02 = x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 – x 100.76 100 101.000.77 99.58 98.28 98.07 101.35 7 101.78 98.39 98.7 100.94 100.38 99.46 100.6 98.66 98.84 98.71 4 101. The mean will be =.85 g – A distribution can be plotted with x on the x-axis.81 101.42 99. an inspector takes 10 random samples from production Each sample contains 15 slabs of chocolate Mean value of each sample is determined.3 98. inferences can be made about the population parameters without having information about the shape of the population distribution other than the information obtained from the sample.34 100.56 100.86 5 98.51 100.39 99.34 101.85 99.39 101.11 100.23 100.66 101.52 98.000.18 101.86 98.66 98.27 99.000.89 100.94 99.42 99.000 chocolate bars Nominal weight of each chocolate bar is 100 g To verify the weight of the population.02 98.23 100. This is a large enough number so that it can be Assume a random sample of just one salary value is selected that happens to be $90.96 100.4 99.15 6 98.11 9 98.55 101.6 98. This value is a long way from the mean value of $40.19 100.75 99.000. This is still far from $40.82 99.23 100.13 98.17 98.03 100.16 99.87 99.68 10 99.31 99.88 99.73 99.48 3 101. The practicality of these relationships with the central limit theory is that by sampling.71 101.53 98.8 99.94 98.89 98.89 101.8 99.42 98.94 98.17 101.87 99. Assume that the distribution of the employee salaries is considered normal with an average salary of $40. Sampling of individual salaries is made using random computer selection: ● ● Variability and sample size Consider a large organization such as a government unit that has over 100.79 101.65 98.52 98. considered infinite.26 98.81 99.07 98.84 98.27 98.25 98.95 99.48 98. x Sample number 1 2 100.24 98.8 98.16 100.84 100.27 99.62 100.15 99.06 100.56 101.74 99.26 98.98 100.23 99.03 99.2 100.48 100.09 98.16 101.38 101. either from non-normal populations or normal populations. This is – x – x Mean value of all the x is = or 99.64 98.13 98.03 101.06 98.08 101.68 99.61 99.09 98.55 99.17 101.2 101.22 99.84 98.99 99.000 employees.01 98.18 99.72 98.93 98.57 101.04 100.000 90.74 99.15 101.85 99.66 101.Chapter 6: Theory and methods of statistical sampling 195 Table 6.49 100.3 99.31 101.71 99.56 99.61 101. Assume now that random samples of two salaries are taken which happen to be $60.000)/2].

the mean value of these is $46. increasing the sample size reduces the spread or variability of the average value of the samples taken. and the sample size.000 or closer to the population average of $40. by the following relationship: σx σx n 6(ii) This indicates that as the size of the sample increases. if we take a series of samples from the employees and measure each time. and the corresponding profile of the sample distribution of the means with its standard error. is a better estimator of the population parameter than a distribution of sample means that is widely dispersed with a larger standard error. will almost certainly have different values each time simply because the chances are that our salary numbers in our sample will be different. The standard error indicates the magnitude of the chance error that has been made. the standard deviation of the sampling distribution. we x Sampling for the Means from an Infinite Population An infinite population is a collection of data that has such a large size that sampling from an infinite population involving removing or destroying some of the data elements does not significantly impact the population that remains. ● Thus. by taking larger samples there is a higher probability of making an estimate close to the population parameter. $45.10. Sample mean and the standard error – The mean of a sample is x and the mean of all possible samples withdrawn from the population is =. which shows the shape of a normal distribution with its standard deviation. variations in the volumes of liquid in cans of soft drinks. σx. As a comparison to the standard error. There are variations in the age of people.000. and $20. $90. The standard deviation of the sampling distribution is more usually referred to as the standard error of the sample means. populations show variation. That is. the standard deviation of the sampling distribution decreases. variations in the weights of a nominal chocolate bar.196 Statistics for Business If now random samples of five salaries $60. variations in the per capita income of individuals. By the central limit theory.000. Alternatively. For example.000. From the central limit theory. the – value of salaries. to the population standard deviation. This variability.000. etc. the arithmetic mean of the sample is said to be an unbiased estimator of the population mean. A distribution of sample means that has less variability.000.000 come up. is due to the chance or sampling error in our analysis between the samples we took and the population. $15. and the population causes variability in our analysis. or less spread out. as measured by the standard error of equation 6(ii). the difference between each sample. among the several samples. as evidenced by a small value of the standard error. or more simply the standard error as it represents the error in our sampling experiment. n. we have a standard deviation of a population. μx: = μ x 6(i) x And because of this relationship in equation 6(i). This is not an error but a deviation that is to be expected since by their very nature. the x mean of the entire sample means taken from the population can be considered equal to the population mean. going back to our illustration of the salaries of the government employees. σ x is related -. These comparisons are illustrated in Figure 6. and also the accuracy when using a sample statistic to estimate the population parameter. .

replaces the random variable. and the random variable. the mean value μx is replaced by the sample = mean x .10 Population distribution and the sampling distribution. μx. the standard equation then becomes. σx. and the standard error of the sampling distribution σx n replaces the standard deviation of the population. x . in a normal distribution is as follows: z x μx σx 5(ii) The standard equation for the sampling distribution of the means now becomes.Chapter 6: Theory and methods of statistical sampling 197 Figure 6. except that now the mean value of – the sample mean. σ x This relationship can be used using the four normal Excel functions already presented in Chapter 4. of the population distribution. the sample distribution or the sample error. we have shown that the standard relationship between the mean. is replaced by the standard deviation of -. the standard deviation. x σx x x μx σx x σx μx n 6(iv) z An analogous relationship holds for the sampling distribution as shown in the lower distribution of Figure 6. σx. the standard deviation of the normal distribution. . x. x. z x μx σx x σx x 6(iii) Substituting from equations 6(i) to 6(iii). Population distribution Mean x Standard deviation x x Sampling distribution Mean x x x x n x Modifying the normal transformation relationship In Chapter 5.10 where now: ● ● ● the random variable x is replaced by the sam– ple mean x. The following application illustrates the use of this relationship.

1 0.3 0.3 0. – Again from equation 6(iv) when x 7.81%.0671 From [function NORMSDIST] in Excel this gives a value from the left end of the curve of 63.1 bars on either side of the mean.0671 0. what proportion of sample means would have a release pressure between 6.8 and 7.2 0.1061 0.8284 0.8 and 7.4721 0. z x μx σx 6. the valves are automatically preset so that they open and release a flow of water when the upstream pressure in a heater exceeds 7 bars.1061 0.25 37.8 bars.0 0.97 79.9425 From [function NORMSDIST] in Excel this gives an area from the left end of the curve of 25. Thus.97%. z x σx μx n 6.8 7.25%.0 0.06%.1 bars. or a sample of size 1.6667 σx σx n 0.1 bars is 82. z x μx σx 7.3 0.1 bars? Here now we are sampling from the population with a sample size of 20.8 bars. the proportion of sample means that would have a release pressure between 6. 2. From equation 5(ii) when x 7. If many random samples of size eight were taken.2 0.8 and 7. z x σx μx n 6. what proportion of sample means would have a release pressure between 6. 1.8 and 7.1 bars is 63. Thus. If many random samples of size 20 were taken.3 20 0.74%.1 bars.0 0. gives the area under the curve from the left of 2.1 7.9814 From [function NORMSDIST] using the standard error in place of the standard deviation.8 7. .1 bars? Here we are only considering a single valve.1 7. z x σx μx n 7.8850 From [function NORMSDIST] in Excel using the standard error in place of the standard deviation. from the population between 6.3 8 0.71 2.30 bars. 3.8 and 7.0 0. the probability that a randomly selected valve has a release pressure between 6.0 0. σx σx n 0.8 7.8 bars.8 and 7. Using this value in equation 6(iv) when – x 6.1 0.3 4.0671 2. Using equation 6(ii) the standard error is. From equation 5(ii) when x 6.3 0.3 2. gives the area under the curve from the left of 82.06 25. In the production process. What proportion of randomly selected valves has a release pressure between 6.71%. In the manufacturing process there is a tolerance in the setting of the valves and the release pressure of the valves follows a normal distribution with a standard deviation of 0.1061 Using this value in equation 6(iv) when – x 6. Using equation 6(ii) the standard error is.1061 1.1 bars? Here now we are sampling from the normal population with a sample size of 8.1061 0.3333 From [function NORMSDIST] in Excel using the standard error in place of the standard deviation.2 0.198 Statistics for Business Application of sampling from an infinite normal population: Safety valves A manufacturer produces safety pressure valves that are used on domestic water heaters.

0 bars. Thus.2 0.0424 Proportion 37. Thus. limited.08%) not clustered around the mean. Alternatively. z x σx μx n 7.0424 79.8 7. z x σx μx n 6.1 bars (%) between 6.19% (100% 37. It implies that if one piece of the data from the population is .Chapter 6: Theory and methods of statistical sampling gives the area under the curve from the left of 0.8 bars.0424 2.3 50 0.81%) not clustered around the mean. – Again from equation 7(v) when x 7.3585 From [function NORMSDIST] using the standard error in place of the standard deviation. As in the calculations for the normal distribution.08%.1 bars? Here now we are sampling from the population with a sample size of 50.8 and 7.08 0.11.8 and 7. as the sample size increases there is a smaller dispersion of the values.06%.08% clustered around the mean and only 0. Using equation 6(ii) the standard error is.3000 0.1 7. if we wish we can avoid always calculating the value of z by using the [function NORMSDIST].0711 0.0424 0. What we observe is that not only the standard error decreases as the sample size increases but there is a larger proportion between the values of 6.92% (100% 99.8 and 7.0 0.08 1.1 0.6 Sample size Example.0424 0. what proportion of sample means would have a release pressure between 6.1 bars. Using this value in equation 6(iv) when – x 6. gives the area under the curve from the left of 99. and the relation of the central limit theory applies.20%.14 93.4903 From [function NORMSDIST] using the standard error in place of the standard deviation. safety valves.1 0.1 bars.8 and 7.0 0. – Again from equation 6(iv) when x 7.6 and the concept is illustrated in the distributions of Figure 6.0671 0. or small size.3 7.74 93.1061 0.1 7. z x σx μx n 7.81% of the data clustered around the values of 6.14%. the proportion of sample means that would have a release pressure between 6.00 99.1 bars which means that there is 62.0671 199 Table 6.714 From [function NORMSDIST] using the standard error in place of the standard deviation. 4.81 between 6.0671 0.8 and 7. To summarize this situation we have the results in Table 6.0 0. σx / n 0. the proportion of sample means that would have a release pressure Sampling for the Means from a Finite Population A finite population is a collection of data that has a stated.1 bars is 99. gives the area under the curve from the left of 93.1 bars is 93.20 99. 1 8 20 50 Standard error. σx σx n 0.0424 4.20 0.1 bars. gives the area under the curve from the left of 0%. That is a larger cluster around the mean or target value of 7. If many random samples of size 50 were taken. in applying these calculations the assumption is that the sampling distributions of the mean follow a normal distribution. For example.8 and 7. in the case of a sample size of 1 there is 37. Note. In the case of a sample size of 50 there is 99.08%.

8 7.71% Sample size 8 6.0 7. or removed. σx σx 6(ii) n .1 82. safety valves.11 Example.8 7.1 99. there would be a significant impact on the data that remains. 37. that is the size is relatively small and there is sampling with replacement (after each item is sampled it is put back into the population).0 7.08% Sample size 50 6. then we can use the equation for the standard error already presented. Modification of the standard error If the population is considered finite.1 93.1 destroyed.200 Statistics for Business Figure 6.0 7.8 7.8 7.81% Sample size 1 6.0 7.20% Sample size 20 6.

n N 19 290 0.55% of the population This ratio is greater than 5% and so we use the finite population multiplier in order to calculate the standard error. Application of sampling from a finite population: Work week A firm has 290 employees and records that they work an average of 35 hours/week with a standard deviation of 8 hours/week. This correction is applied when the ratio of n/N is greater than 5%. and n is the size of the sample.2500 is 40. N N n 1 6(vi) σx n N N n 1 6(v) then from equation 6(iii) for a value of 2. If a sample size of 19 employees is taken.87%. From [function NORMSDIST] in Excel. We know that the difference between the random variable and the population. In this case. What is the probability that an employee selected at random will be working between 2 hours/week of the population mean? In this case. (x μx) is equal to 2. if we are sampling without replacement. of size 290. The ratio n/N is. For a value of (x μx) 2 we have again from equation 6(ii).87 40.13%. assuming that the population follows a normal distribution.0655 or 6. symmetrical. and a normal distribution is. .13 19. 59.74% 2. of size 19 taken from a population. the area under the curve from the left to a value of z of 0. 1. Thus. the standard error of the mean is modified by the relationship. N. z x σx μx n σx n x μx N N n 1 6(vii) The application of the finite population multiplier is illustrated in the following application exercise. Or we could have simply concluded that z is 0.34% or less than 5% and so the population multiplier is not needed. by definition.2500 is the finite population multiplier. n 1 and N 290.Chapter 6: Theory and methods of statistical sampling However.2500 since the assumption is that the curve follows a normal distribution. Thus. again we have a single unit (an employee) taken from the population where the standard deviation σx is 8 hours/week.2500 201 From [function NORMSDIST] in Excel.2500 is 59. the probability that an employee selected at random will be working between 2 hours/week is. where N is the population size. the area under the curve from the left to a value of z of 0. again we have a sample. From equation 6(vi). n. z x μx σx 2 8 0. equation 6(iv) now becomes. (x μx) z x μx σx 2 8 0. σx Here the term. Thus. n/N 1/290 0. what is the probability that the sample means lies between 2 hours/week of the population mean? In this case.

7773 4.9377 271 289 μx 2 we 0.96% 19 2 35 . Thus. Note that 73.98%.9684 1.96%. From [function NORMSDIST] in Excel. 86.74%.96% is greater than 19.98 13. the probability that the sample means lie between 2 hours/week is.7773 1.3589 equation 6(vii) where now 2. because as we increase the sample size.74% 1 35 Sample size 73. the area under the curve from the left to a value of z of 1.9684 From – μ x x z 8 * 0. 2 Sample size 19.9684 1.02 40. work week. This concept is illustrated in Figure 6.202 Statistics for Business From equation 6(vii) for – x have.13 73.12. obtained for a sample of size 1.1253 From equation 6(v) the corrected standard error of the distribution of the mean is. σx σx n N N n 1 8 19 * 0. the area under the curve from the left to a value of z of 1.1253 is 86.1253 From [function NORMSDIST] in Excel. Figure 6.02%. the sampling distribution of the means is clustered around the population mean.12 Example.7773 N N n 1 (290 19) (290 1) 0. x σx n μx N N n 1 2 1.1253 is 13. For – μx x 2 we have. z σx x n μx N N n 1 2 1.

In Los Angeles country. x . We might extend this sample experiment further and say that 72. taken from the sample having the desired characteristic divided by the sample size. is the sampling distribution of the p proportion.450 say they are for gun control. n. 203 Sampling Distribution of the Proportion In sampling we may not be interested in an absolute value but in a proportion of the population. μ. In the binomial distribution the mean number of successes. Thus. We then take another sample p from the population and we have a new value – . we have established a binomial situation. with a characteristic probability of success. The proportion in the sample that says they are for gun control is thus 72.000? In these cases. n. n. the procedure is to sample from the population and then again use inferential statistics to draw conclusions about the population proportion. Thus.Chapter 6: Theory and methods of statistical sampling the United States population is for gun control.000 people from the State of California and 1. Measuring the sample proportion When we are interested in the proportion of the population. these would be very uncertain conclusions since the 2. assume we are interested in people’s opinion of gun control. In these types of situations we use sampling for proportions.000-sample size may be neither representative of California. and probably not of the United States. p. either the houses have a market value greater than $500. 6(x) μ p p . We sample 2.000). For example.000 or they do not. we have. for a sample size. the proportion in the sample that is against gun control is 27. If we repeat this process then we possibly p2 will have different values of –. The probability p distribution of all possible values of the sample proportion. or they do not.000 per year? What proportion of the houses in Los Angeles country in United States has a market value more than $500. Sampling distribution of the proportion In our sampling process for the proportion. This experiment is binomial because either a person is for gun control or is not.50% (1. or.50% of the population of California is for gun control or even go further and conclude that 72.50%). what proportion of the population will vote conservative in the next United Kingdom elections? What proportion of the population in Paris. p x n 6(viii) Binomial concept in sampling for the proportion If there are only two possibilities in an outcome then this is binomial. discussed in the previous section.000/year.50% (100% 72.50% of Dividing both sides of this equation by the sample size. μ np 6(ix) p n n The ratio μ/n is now the mean proportion of – successes written as μp .450/2. –. In Paris either an individual earns a salary more than €60. The sample proportion. However. This is analogous to the sampling – distribution of the means. is the ratio p of that quantity. assume that we take a random sample and measure the proportion having the desired characteristic and this is –1. France has a salary more than €60. is given by the relationship presented in Chapter 4: μ np 4(xv) For example. x. In the United Kingdom elections either a person votes conservative or he or she do not. –.

33 33 or greater than 5 From Chapter 5. – x . or 0. p. the population mean by the population proportion.33) 67 or again greater than 5 3 Economic and financial indicators. n(1 p) 100(1 0. the standard deviation of binomial distribution is given by the relationship. Here p is still 33%. z x σx x x μx σx 6(iv) Application of sampling for proportions: Part-time workers The incidence of part-time working varies widely across Organization for Cooperation and Development (OECD) countries . σp pq n p(1 p) n 6(xii) Alternatively. thus from equation 5(iv). we have. p σx the standard error of the sample means by -. what proportion between 25% and 35%. – p μx. the sample mean by the average sample proportion. if these criteria apply then by substituting in equation 6(iv) as follows. in the sample. we can use the normal distribution to approximate the binomial distribution when the following two conditions apply: np n(1 p) 5 5 5(iv) 5(v) That is. From equation 6(iv) we have the relationship. would be part-time workers? Now. where the value q 1 p And again dividing by n. 20 July 2002. the sample size is 100 and so we need to test again whether we can use the normal probability assumption by using equations 5(iv) and 5(v). the products np and n(1 p) are both greater or equal to 5. z x σx x x μx σx p σp p 6(xiii) From equation 5(v). If a sample of 100 people of the work force were taken in the Netherlands. and using the relationship developed in equation 6(iii). 88. σp and thus. σp the standard error of the proportion -. σp pq n p(1 p) n 6(xiv) p(1 p) n where the ratio σ/n is the standard error of the -. σ n pqn n pqn n2 pq n p(1 p) 6(xi) n z p p 6(xv) Since from equation 6(xii). The clear leader is the Netherlands where part-time employment accounts for 33% of all jobs. p p z p(1 p) n 6(xvi) The application of this relationship is illustrated as follows.3 1.33 and n is 100. Thus. np 100 * 0. . proportion.204 Statistics for Business Again from Chapter 4. σ npq np(1 p) 4(xvii) Then. The Economist. we can say that the difference between the sample proportion – and the popup lation proportion p is.

0469 0.4264 is 66. Here the mean value of the proportion for the population is 33% and the . p or 0.02 0. thus from equation 5(iv).63%.81 71.0332 0. and thus from equation 6(xiv) the standard error of the proportion is.0332 0.25 0. is 35%.0332 0.82% or 0.33. Thus. –.0469 0. the area under the curve from the left to a value of z of 0.67 200 The lower sample proportion. the proportion between 25% and 35%.35 0.0011 0.33 * 0.35 and thus from equation 6(xiii). is 25%. Thus.35 0. we can apply the normal probability assumption.25 0.0469 0.0469 0. in the sample. as the sample size increases the values will cluster around the mean value of the population.25 and thus from equation 6(xiii). The upper sample proportion.7182 2.6015 is 72.4264 From [function NORMSDIST] in Excel. we need to test whether we can use the normal probability assumption by using equations 5(iv) and 5(v).47%.33 0.02 0.33 0.33) 134 or again greater than 5 205 Thus.0332 2.33(1 0. If a sample of 200 people of the work force were taken in the Netherlands. the area under the curve from the left to a value of z of 1. is 25%. or 0. what proportion between 25% and 35%.44 62. 66. –.0800 0. we can apply the normal probability assumption. the proportion between 25% and 35% in a sample size of 200 that would be part-time workers is.81%. or 0. np 200 * 0.33) 100 0.33 and n is 200.67 100 From equation 5(v) n(1 p) 200(1 0.7058 From [function NORMSDIST] in Excel. Here p is 33%. σp 0.33 66 or greater than 5 Note that again this value is larger than in the first situation since the sample size was 100 rather than 200. z p σp p 0.0332 0. z p σp p 0.44%.4061 is 0.33 0.Chapter 6: Theory and methods of statistical sampling Thus. –. or 0.4061 From [function NORMSDIST] in Excel the area under the curve from the left to a value of z of 2.33(1 0.7058 is 4.33. 72.35 and thus from equation 6(xiii). σp 0.33 * 0.0469 1. p or 0. and thus from equation 6(xiv) the standard error of the proportion is. –. As for the mean.33 0.0022 0. The population proportion p is 33%. is 35%. or p 0. p or 0.6015 From [function NORMSDIST] in Excel. the area under the curve from the left to a value of z of 0.25 and thus from equation 6(xiii). in the sample. z p σp p 0.63 0. would be part-time workers? First. The population proportion p is 33%. The upper sample proportion.6203 The lower sample proportion.33) 200 0.03% or 0. z p σp p 0.0800 0. that would be part-time workers is.47 4.

This would be biased as the West End is pretty affluent and the voters sampled are more likely to vote Tory (Conservative).206 Statistics for Business Figure 6. This section gives considerations when undertaking sampling experiments. For example. To measure the average intelligence quotient (IQ) of all the 18year-old students in a country you take a sample of students from a private school. This would be biased because private school students often come from high-income families and their education level is higher.13. present in the sample data that gives lopsided. part-time workers.03% 100 33% Sample size 71. you wish to obtain the voting intentions of the people in the United Kingdom and you sample people who live in the West End of London. This concept is illustrated in Figure 6. Sample size 62. To measure the average income of residents of Los Angeles. misleading. Bias is favouritism. California you take Sampling Methods The purpose of sampling is to make reliable estimates about a population. or unrepresentative results. purposely or unknowingly. Bias in sampling When you sample to make estimates of a population you must avoid bias in the sampling experiment. It is usually impossible. and too expensive.13 Example. As the box opener “The sampling experiment was badly designed!” indicates. . to sample the whole population so that when a sampling experiment is developed it should as closely as possible parallel the population conditions. the sampling experiment to determine voter intentions was obviously badly designed.62% 200 33% sample proportions tested were 25% and 35% or both on the opposite side of the mean value of the proportion.

Each pig would have identification. 207 Table 6.8 Table of 12 random numbers between 1 and 200.7 Table of 63 random numbers between 1 and 630. either a tag. The same procedure would apply to the farmer and his pigs. 142 156 26 95 178 176 146 144 72 113 7 194 Excel and random sampling In Excel there are two functions for generating random numbers. and [function RANDBETWEEN] that generates a random number between the lowest and highest number that you specify. physical units such as products coming off a production . then the first 15 probably were more thoroughly cleaned than the rest – the maid was less tired! These sampling experiments are not random and probably they are not representative of the population. The farmer would generate a list of 12 random numbers between 1 and 200 as indicated in Table 6. the matrix in Table 6. 389 380 440 84 396 105 512 386 473 285 219 285 161 49 309 249 353 78 306 528 368 75 56 339 560 557 438 25 174 323 173 272 300 510 75 350 270 583 347 183 288 36 314 147 620 171 406 437 415 70 605 624 476 114 374 251 219 426 331 589 485 368 308 Table 6. Systematic sampling When a population is relatively homogeneous and you have a listing of the items of interest such as invoices. This would be biased as people who live in Santa Monica are wealthy. A business might want to sample 15% of its clients to obtain the level of customer satisfaction. The manager samples the first 15. [function RAND] that generates a random number between 0 and 1. You first create a random number in a cell and copy this to other cells. and weigh those 12 pigs that correspond to those numbers. For example. In order to perform random sampling. or embedded chip giving a numerical indication from 1 to 200. Each time you press the function key F9 the random number will change. For example. He samples the first 12 who come when he calls. Randomness in your sample experiment A random sample is one where each item in the population has an equal chance of being selected. Assume a farmer wishes to determine the average weight of his 200 pigs. you need a framework for your sampling experiment. as an auditor you might wish to analyse 10% of the financial accounts of the firm to see if they conform to acceptable accounting practices. Thus. Suppose that as an auditor you have 630 accounts in your population and you wish to examine 10% of these accounts or 63. you would examine those accounts corresponding to those numbers. a fleet of company cars. tattoo. You number the accounts from 1 to 630. You then generate 63 random numbers between 1 and 630 and you examine those accounts whose numbers correspond to the numbers generated by the random number function. A hotel might want to sample 12% of the condition of its hotel rooms to obtain a quality level of its operation.8. a hotel manager wishes to determine the quality of the maid service in his 90-room hotel. If the maid works in order. They are probably the fittest – thus thinner than the rest! Or.7 shows 63 random numbers within the range 1 to 630.Chapter 6: Theory and methods of statistical sampling a sample of people who live in Santa Monica.

Consider for example. You sample every 25th can of drink. In order to limit the cost and the time of the sampling experiment we decide to only survey 60 of the employees. 4 days/week.5% sample you analyse every 200 units 0.9 Department Employees Sample size Stratified sampling. teenagers in the age range 13 to 19 and their parents in the age range 40 to 50 differ very much in their tastes and ideas! Several strata of interest In a given population you may have several well-defined strata and perhaps you wish to take a representative sample from this population. 5 days a week to a proposed 10 hours/day. socio-economic levels.200). In this case.9. inventory going into storage. people in the 20–25 have a different preference of music and different needs of portable phones than say those in the 50–55-age range. Care must be taken in using systematic sampling that no bias occurs where the interval you choose corresponds to a pattern in the operation. the 1st row of Table 6. Each function is a stratum since it defines a specific activity. For example. married or single households.200 employees in the firm and so 60 represents 5% of the total workforce (60/1. Stratified sampling The technique of stratified sampling is useful when the population can be divided into relatively homogeneous groups. etc. etc. Stratified sampling is used when there is a small variation within each group. affiliated with the labour or conservative party. a stretch of road. you will be sampling a can that has been filled from the same nozzle. then systematic sample may be appropriate. If you want a 0. the strata may be students. male or female. There are a total of 1. but a wide variation among groups. For example. every 10th household receives a more detailed survey form to complete. For example. and random sampling is made only on the strata of interest. we would take a random sample of 5% of the employees from each of the departments or strata such that the sampling experiment parallels the population. you use systematic sampling to examine the filling operation of soft drink machine. if you want a 4% sample you analyse every 25th unit 4% of 100 is 25. people of a certain age range. or strata. is a form of systematic sample where although every household receives a survey datasheet to complete. The United States population census. Administration Operations Design R&D 160 8 300 15 200 10 80 4 Sales 260 13 Accounting 60 3 Information Systems 140 7 Total 1. It so happens that there are 25 filling nozzles on the machine. undertaken every 10 years. accurately reflects the characteristics of the target population. Suppose we wish to obtain the employees’ preference of changing from the current 8 hours/day. or a row of houses. For example. You first decide at what frequency you need to take a sample. Single people of a certain socioeconomic class are more likely to buy a sports car. Stratified sampling is used because it more Table 6. The number that we would survey is given in the 2nd row of Table 6.9 which gives the number of employees by function in a manufacturing company.200 60 .5% of 100 is 200.208 Statistics for Business line. Thus. If you want a 5% sample you analyse every 20th unit 5% of 100 is 20.

if properly designed. in her survey she would only interview females. The interviewer conducts her survey in a busy shopping area such as London’s Oxford Street. assume Birmingham is targeted for preference of a certain consumer product. or a situation. Cluster sampling. or clusters. “Which of the following best describes your professional activity?” as for example in Table 6. and each cluster is then sampled at random. sent by electronic mail. For example. or soliciting the information in areas frequented by potential consumers such as shopping malls or busy pedestrian areas. For example. you are male and the interviewer is targeting females. because required data is unavailable from other sources. it could be that you do not fit the strata desired by the interviewer. Surveys are often used to obtain ideas about a new product. or market surveys. and who are elegantly dressed. The sampling plan would use one. you appear to be over 50 and the interviewer is targeting the age group under 40. Table 6. you are white and the interviewer is targeting other ethnic groups. or quota sampling. Over 25–34 35–44 45–54 55–65 65 . For example.Chapter 6: Theory and methods of statistical sampling 209 Cluster sampling In cluster sample the population is divided into groups. or a combination of the methods above – simple random sampling. Thus. a concept. For example. rather than asking the question “How old are you?” give the respondent age categories as for example in Table 6. The city is divided into clusters using a city map and an appropriate number of clusters are selected for analysis. If you are in an area where surveys are being carried out. This stratification should give a reasonable probability that the selected candidates have some interest. then you might use a consumer survey.10 Under 25 Age range for a questionnaire. or requested in person. if you want to know the job of the respondent rather than asking. Avoid open-ended questions. stratified. etc. systematic. you should structure it so that this task is straightforward with responses that are easy to organize. or sample. “What is you job?” ask the question. perhaps less than 40. regarding the fashion magazine in question. The survey information is prepared on questionnaires. where responses are solicited from individuals who are targeted according to a well-defined sampling plan. Alternatively. cluster. Quota sampling In market research. completed by telephone. The collected survey data. Cluster sampling is used when there is considerable variation in each group or cluster.11. is then analysed and used to forecast or make estimates for the population from which the survey data was taken. but groups are essentially similar. the interviewer may be interested to obtain information regarding a ladies fashion magazine. In this type of sampling often the population is stratified according to some criteria so that the interviewer’s quota is based within these strata. In the latter case this may be either going door-to-door. which might be sent through the mail. can provide more accurate results than simple random sampling from the population. Here these categories are all encompassing. interviewers carrying out the experiment may use quota sampling where they have a specific target quantity to review. and thus an opinion.10. Consumer surveys If your sampling experiment involves opinions say concerning a product. When you develop a consumer survey remember that it is perhaps you who have to analyse it afterwards. Using quota sampling.

Often businesses use outside consulting firms specialized in developing consumer surveys. there may be a trade-off between using less costly. then the data is considered primary data. but perhaps less accurate. Person-to-person contact gives a much higher response for consumer surveys since if you are stopped in the street. The disadvantage is that the secondary data may not contain all the information you require. “everyone is too busy”. There is the operating side of collecting the data. or consumer patterns.11 Which of the following best describes your professional activity? Construction Consulting Design Education Energy Financial services Government Health care Hospitality Insurance Legal Logistics Manufacturing Media communications Research Retail Telecommunications Tourism Other (please describe) Primary and secondary data In sampling if we are responsible for carrying out the analysis. secondary data. The advantage with secondary data. primary data. If the sample experiment is well designed then this primary data can provide very useful information. Secondary data is information that has been developed by someone else but is used in your analytical work. Though if you have access to portable phone numbers this may not apply. economic trends. of designing the survey and the subsequent analysis. the questionnaire only reaches those who have E-mail. provided that it is in the public domain. and then those who care to respond. Consumer surveys can be expensive. Table 6. This is not all encompassing but there is a category “Other” for activities that may have been overlooked. Electronic mail surveys give a reasonable . Telephone surveys give a higher return because voice contact has been obtained. Thus.210 Statistics for Business response. The other segment of the population. and the associated cost. Postal responses have a very low response and their use has declined. Secondary data might be demographic information. Those people who do respond may not be representative in the sample. There is the cost of designing the questionnaire such that it is able to solicit the correct response. and then the subsequent analysis. is not available because they are working. and/or it may be not be up to date. Soliciting information from consumers is not easy. as it is very quick to send the survey back. The disadvantage with primary data is the time. or non-employed individuals who are more likely to be at home when the telephone call is made. again the sample obtained may not be representative as those contacted may be the unemployed. However. the format may not be ideal. In some instances it may be possible to use secondary data in analytical work. usually larger. is that it costs less or at best is free. or at least responsible for designing the consumer surveys. and more expensive but more reliable. However. which is often available through the Internet. retirees or elderly people. a relatively large proportion of people will accept to be questioned.

n. Using the binomial relationships for the mean and the standard deviation. basic relationships. the more reliable is our estimate of the population parameter. z. since either values in the sample have the desired characteristics or they do not. This theory states that as the size of the sample increases.Chapter 6: Theory and methods of statistical sampling 211 This chapter has looked at sampling covering specifically. With this standard error of the proportion. and the value of the sample proportion. the more the data clusters around the population mean and there is less variability in the data. Chapter Summary Statistical relations in sampling for the mean Inferential statistics is the estimate of population characteristics based on the analysis of a sample. . The larger the sample size. we can develop the standard error of the proportion. sampling for proportions. there becomes a point when the distribution of the sample means can be approximated by the normal distribution. the larger the sample size. the closer is our estimate to the population proportion. the mean of all sample means withdrawn from the population is equal to the population mean. Again. Further. as the sample mean less the population mean divided by the standard error. When we have done this. the standard error of the estimate in a sampling distribution is equal to the population standard deviation divided by the square root of the sample size. which is the square root of the ratio of the population size minus the sample size to the population size minus one. When we have a finite population we modify the standard error by multiplying it by a finite population multiplier. the larger the sample size. Again as before. Sampling for the means from a finite population A finite population in sampling is defined such that the ratio of the sample size to the population size is greater than 5%. When we use this relationship we find that the larger the sample size. we can use this modified relationship to infer the characteristics of the population parameter. It is the central limit theory that governs the reliability of sampling. we can make an estimate of the population proportion in a similar manner to making an estimate of the population mean. sampling for the mean in infinite and finite populations. The binomial relationship governs proportions. and sampling methods. In this case. the more the sample data clusters around the population mean implying that there is less variability. Sampling for the means for an infinite population An infinite population is a collection of data of a large size such that by removing or destroying some data elements the population that remains is not significantly affected. Sampling distribution of the proportion A sample proportion is the ratio of those values that have the desired characteristics divided by the sample size. This means that the sample size is large relative to the population size. Here we can modify the transformation relationship that apply to the normal distribution and determine the number of standard deviations.

and secondary data that maybe in the public domain. .212 Statistics for Business Sampling methods The key to correct sampling is to avoid bias. Cluster sampling is another way of making a sampling experiment when the population is divided up into manageable clusters that represent the population. which is taking samples at predetermined intervals according to the desired sample size. or number of units to analyse that may be according to a defined strata. If we have a relatively homogeneous population we can use systematic sampling. Primary data is normally the most useful but is usually more costly to develop. by E-mail. Quota sampling is when an interviewer has a certain quota. that is not taking a sample that gives lopsided results. Stratified sampling can be used when we are interested in a well-defined strata or group. Consumer surveys are part of sampling where respondents complete questionnaires that are sent through the post. and then sampling an appropriate quantity within a cluster. When you construct a questionnaire for a consumer surveys. Microsoft Excel has a function that generates random numbers between given limits. or face to face contact. avoid having open-ended questions as these are more difficult to analyse. and to ensure that the sample is random. completed over the phone. or that collected by the researcher. Stratified sampling can be extended when there are several strata of interest within a population. In sampling there is primary data.

What percentage of the bags produced has a breaking strength between 8.5 and 7. What is the probability that in 10 accounts chosen at random. These bags have a nominal breaking strength of 8 kg/cm2 with a production standard deviation of 0.Chapter 6: Theory and methods of statistical sampling 213 EXERCISE PROBLEMS 1. Telephone calls Situation It is known that long distance telephone calls are normally distributed with the mean time of 8 minutes. 6.5 kg/cm2? 2. Explain the differences.0 and 8. the average monthly balance will lie between £180 and £250? 2.0 and 8. Compare the answers of Questions 1 and 3. What proportion of the sample means of size 10 will have breaking strength between 8. Required 1.70 kg/cm2. the sample average monthly balance will lie between £180 and £250? 4.5 and 7. and 2 and 4. What percentage of the bags produced has a breaking strength between 6. Food bags Situation A paper company in Finland manufactures treated double-strength bags used for holding up to 20 kg of dry dog or cat food. What assumptions are made in determining these estimations? 2. the sample average monthly balance will lie between £180 and £250? 3. What is the probability that in 25 accounts chosen at random. The manufacturing process of these food bags follows a normal distribution.5 kg/cm2? 4.5 kg/cm2? 3. . a large bank knows that the average monthly credit card account balance is £225 with a standard deviation of £98. 5. Required 1. What is the probability that in an account chosen at random.5 kg/cm2? 5. Credit card Situation From past data. and the standard deviation of 2 minutes. What proportion of the sample means of size 10 will have breaking strength between 6. What distribution would the sample means follow for samples of size 10? 3.

What is the probability that a call taken at random will last between 7. There is a maintenance rule such that if the sample average content of 10 cups falls below 32. With a nominal machine setting of 33 cl. If random samples of 25 calls are selected. is made and sold continuously throughout the day. Explain the difference in the results. Baking bread Situation A hypermarket has its own bakery where it prepares and sells bread from 08:00 to 20:00 hours. how often would this happen at a nominal machine setting of 33 cl? 5.8 and 8.5 and 8. What is the volume that is dispensed such that only 5% of cups contain this amount or less? 2. what is the probability that a call taken at random will last between 7. what is the volume that will be exceeded by 95% of sample means? 4. what is the probability that a call taken at random will last between 7. If random samples of 100 calls are selected.8 and 8.50 cl. called “pave supreme”. Required 1.0 minutes? 5.0 minutes? 7. If the machine is regulated such that only 5% of the cups contained 30 cl or less. This bread. by how much could the nominal value of the machine setting be reduced? In this case.5 and 8. 4. a technician will be called out to check the machine settings. What should be the nominal machine setting to ensure that no more than 2% maintenance calls are made? In this case.0 minutes? 3.8 and 8. In this case.5 and 8.2 minutes? 6.2 minutes? 4. What is the probability that a call taken at random will last between 7.2 minutes? 2. left for 3 hours to rise before being baked in the oven. which is a nominal 500 g loaf. The filling operation is normally distributed and the standard deviation is 1 cl no matter the setting of the mean value. is individually kneaded. what is the probability that a call taken at random will last between 7. If random samples of 100 calls are selected.214 Statistics for Business Required 1. During the kneading and . One extremely popular bread. If random samples of 25 calls are selected. Soft drink machine Situation A soft drinks machine is regulated so that the amount dispensed into the drinking cups is on average 33 cl. what is the probability that a call taken at random will last between 7. on average customers will be receiving how much more beverage? 5. if samples of 10 cups are taken. on average a customer would be receiving what percentage less of beverage? 3.

and a standard deviation of 11 minutes. what is the probability that the average weight of the four weigh more than 520 g? 3. Financial advisor Situation The amount of time a financial advisor spends with each client has a population mean of 35 minutes. If you go to the store and take at random eight pave supremes. If random sample of 25 clients are selected. If you go to the store and take at random four pave supremes. there is a 35% chance that the sample mean will be below how many minutes? 5. If a random sample of 16 clients is selected. 8.Chapter 6: Theory and methods of statistical sampling 215 baking process moisture is lost but from past experience it is known that the standard deviation of the finished bread is 17 g. Say that you are planning a larger dinner party and you go to the store and take at random eight pave supremes. Explain the differences between Questions 4 to 6. what is the probability that the average time spent per client will be at least 37 minutes? 4. what is the probability that the time spent with the client will be at least 37 minutes? 2. and 5. there is a 35% chance that the time the financial advisor spends with the client will be below how many minutes? 3. What assumptions do you make in responding to these questions? . If random sample of 16 clients are selected. If a random client is selected. If you go to the store and take at random one pave supreme. If you go to the store and take at random one pave supreme. what is the probability that the average weight of the eight breads weigh more than 520 g? 4. what is the probability that the average weight of the loaves will be between 480 and 520 g? 7. If a random client is selected. what is the probability that it will weigh between 480 and 520 g? 5. You are planning a dinner party and so you go to the store and take at random four pave supremes. Explain the differences between Questions 1 to 3. Required 1. 8. what is the probability that it will weigh more than 520 g? 2. what is the probability that the average time spent per client will be at least 37 minutes? 6. what is the probability that the average weight of the loaves will be between 480 and 520 g? 6. 1. 3. Why is the progression the reverse of what you see for Questions 1 to 3? 6. there is a 35% chance that the sample mean will be below how many minutes? 7. If a random sample of 25 clients is selected. Explain the differences between Questions 1.

computers from the electrical systems. What are the upper and lower limits between which 90% of the sample averages will lie for samples of size four? 5. Between them . what is the probability that he will be over 2 m? 2. what percentage of such samples will have average heights over 2 m? 6. Often from these wrecks they recoup engine parts. the height of adult males is normally distributed. Joe and his three colleagues pay themselves €15 each per hour and they work 40 hours/week.400 of the mean? 9. Height of adult males Situation In a certain country. If samples of four men are taken.400 of the mean? 2. Required 1. after buying ASDA in Great Britain. Wal-Mart Situation Wal-Mart of the United States.500. Required 1. salvaged components on an average generate €198 per car with a standard deviation of €55.216 Statistics for Business 7. is now looking to move into France. and batteries. what is the probability that the sample mean of profits for these 50 stores will lie within €5. Explain the differences in the results. 8. Their work consists of visiting sites that have automobile wrecks and recovering those parts that can be resold. If samples of nine men are taken. If Wal-Mart management selects 50 stores at random. What are the upper and lower limits between which 90% of the sample averages will lie for samples of size nine? 7. scrap metal. It has targeted 220 supermarket stores in that country and the present owner of these said that profits from these supermarkets follows a normal distribution. have the same mean. Financial information is on a monthly basis. Automobile salvage Situation Joe and three colleagues have created a small automobile salvage company. From past work. What are the upper and lower limits of height between which 90% will lie for the population of adult males? 3. with a mean of 176 cm and a variance of 225 cm2. with a standard deviation of €37. If one adult male is selected at random. what percentage of such samples will have average heights over 2 m? 4. If Wal-Mart selects a store at random what is the probability that the profit from this store will lie within €5.

If random samples of 400 people in the age range 25 to 64 are selected. what proportion of the samples between 69% and 75% will be white? 3. Business Week. 21 November 2005. what proportion of the samples between 24% and 30% will have at least a bachelors degree? 6. On the assumption that the probability outcome in Question No. The Wall Street Journal. One particular period they carry out salvage work at a site near Hamburg.200? 3. If random samples of 200 people in the age range 25 to 64 are selected. Explain the difference between each paired question of 1 and 2. the population of the United States in the age range 25 to 64 years. If random samples of 400 people in the age range 25 to 64 are selected. 90. took place in Hong Kong between 13 and 18 December 2005. US. 1. 16% of the total population in the same age range were high school dropouts and 27% had at least a bachelor’s degree. According to data. 3 and 4. If random samples of 400 people in the age range 25 to 64 are selected. part of the Doha Round. 2 is achieved. Further in this same year. . what proportion of the samples between 13% and 19% will be high school dropouts? 4. What is the probability that after one weeks work the team will have collected enough parts to generate total revenue of €4. What is the correct standard error for this situation? 2. what proportion of the samples between 13% and 19% will be high school dropouts? 5. EU walk fine line at heart of trade impasse. Germany where there are 72 wrecked cars. and 5 and 6.4 Required 1. p. the average percentage tariff imposed on all imported tangible goods and services in certain selected countries is as follows5: 4 5 Losing ground. If random samples of 200 people in the age range 25 to 64 are selected. 13 December 2005.Chapter 6: Theory and methods of statistical sampling 217 they are able to complete the salvage work on four cars per day. what would be the net income to each team member at the end of 1 week? 10. World Trade Organization Situation The World Trade Organization talks. Education and demographics Situation According to a survey in 2000. 72% were white. Required 1. what proportion of the samples between 69% and 75% will be white? 2. what proportion of the samples between 24% and 30% will have at least a bachelors degree? 7. p. 11. If random samples of 200 people in the age range 25 to 64 are selected.

Malta. If a random sample of 400 imported tangible goods or service into Burkina Faso were selected.5% Required 1. what is the probability that the average proportion of the tariffs for this sample would be between 1% and 4%? 2. Romania.2% Burkina Faso 12. 12. what is the probability that the average proportion of the tariffs for this sample would be between 25% and 32%? 7. 3 and 4. Explain the difference between each paired question of 1 and 2. what proportion between 12% and 22% would be illiterate? 6 Too soon for Turkish delight. and 5 and 6.1% European Union 4. what is the probability that the average proportion of the tariffs for this sample would be between 10% and 14%? 6. what is the probability that the average proportion of the tariffs for this sample would be between 10% and 14%? 3.0% Brazil 12. and Slovakia Europe in 2003. what is the probability that the average proportion of the tariffs for this sample would be between 25% and 32%? 4. If random samples of 500 females over 15 were taken in Turkey in 2003. The Economist. If random samples of 250 females over 15 were taken in Turkey in 2003. and Croatia and three member countries – Greece. If a random sample of 200 imported tangible goods or service into Burkina Faso were selected. p.7% India 29. what proportion between 12% and 22% would be illiterate? 2. If a random sample of 200 imported tangible goods or service into the United States were selected. 1 October 2005.218 Statistics for Business United States 3. 25. If a random sample of 400 imported tangible goods or service into the United States were selected. what is the probability that the average proportion of the tariffs for this sample would be between 1% and 4%? 5. the female illiteracy of those over 15 was reported as follows6: Turkey 19% Greece 12% Malta 11% Romania 4% Croatia 3% Slovakia 0. If a random sample of 400 imported tangible goods or service into India were selected. Female illiteracy Situation In a survey conducted in three candidate countries for the European Union – Turkey.4% Required 1. . If a random sample of 200 imported tangible goods or service into India were selected.

3 and 4. If random samples of 200 people under 25 were taken in France in 2005. what proportion between 0. If random samples of 100 people under 25 were taken in France in 2005. what proportion between 9% and 13% would be illiterate? 4. 21 November 2005. what proportion between 12% and 15% would be unemployed? 2. and 11. the unemployment rate among people under 25 in France was 21. and 5 and 6? 8. what proportion between 9% and 13% would be illiterate? 5. . p.1% and 1. What is your explanation for the difference between each paired question of 1 and 2. what proportion between 12% and 15% would be unemployed? 6. what proportion between 12% and 15% would be unemployed? 7.4% in the United States. If random samples of 250 females over 15 were taken in Malta in 2003.1% and 1. If random samples of 100 people under 25 were taken in Britain in 2005. what proportion between 12% and 15% would be unemployed? 7 France’s young and jobless. These numbers in part are considered to be reasons for the riots that occurred in France in 2005. If random samples of 100 people under 25 were taken in Germany in 2005. If random samples of 200 people under 25 were taken in Germany in 2005.25%. If random samples of 500 females over 15 were taken in Malta in 2003. would you be surprised? 13. what proportion between 12% and 15% would be unemployed? 4.7 Required 1. what proportion between 0. what proportion between 12% and 15% would be unemployed? 8. Unemployment Situation According to published statistics for 2005.0% would be illiterate? 7. what proportion between 12% and 15% would be unemployed? 5. 23. what proportion between 12% and 15% would be unemployed? 3.8% for Germany.6% in Britain. If random samples of 500 females over 15 were taken in Slovakia in 2003. If you took a sample of 200 females over 15 from Istanbul and the proportion of those females illiterate was 0. If random samples of 200 people under 25 were taken in Britain in 2005. Business Week. 12.7% compared to 13. If random samples of 200 people under 25 were taken in the United States in 2005.Chapter 6: Theory and methods of statistical sampling 219 3. If random samples of 250 females over 15 were taken in Slovakia in 2003. If random samples of 100 people under 25 were taken in the United States in 2005.0% would be illiterate? 6.

8. what proportion between 13% and 15% would be in manufacturing? 5.220 Statistics for Business 9. 69. what proportion between 20% and 26% would be in manufacturing? 2. If random samples of 400 people of the working population were taken from Britain in 2005. What is your explanation for the difference between each paired question of 1 and 2. what proportion between 13% and 15% would be in manufacturing? 4. If random samples of 200 people of the working population were taken from the United States in 2005. and 5 and 6. If random samples of 200 people of the working population were taken from Germany in 2005. The Economist. p. what proportion between 6% and 10% would be in manufacturing? 7. 3 and 4. Manufacturing employment Situation According to a recent survey by the OECD in 2005 employment in manufacturing as a percent of total employment. If random samples of 400 people of the working population were taken from the United States in 2005. . If random samples of 400 people of the working population were taken from Germany in 2005. Why do the data for France in Questions 1 and 2 not follow the same trend as for the questions for the other three countries? 14. If random samples of 200 people of the working population were taken from Britain in 2005. What is you explanation for the difference between each paired question of 3 and 4. If a sample of 100 people was taken in Germany in 2005 and the proportion of the people in manufacturing was 32%. what conclusions might you draw? 8 Industrial metamorphosis. has fallen dramatically since 1970. what proportion between 6% and 10% would be in manufacturing? 6. 5 and 6. 1 October 2005. The following table gives the information for OECD countries8: Country 1970 2005 Germany 40% 23% Italy 28% 22% Japan 27% 18% France 28% 16% Britain 35% 14% Canada 23% 14% United States 25% 10% Required 1. and 7 and 8? 10. what proportion between 20% and 26% would be in manufacturing? 3.

14 December 2005. The Economist.03% and 0. Jamaica is one the world’s worst country for homicide. You wish to get information about the whole population included in this database including criteria such as job satisfaction. If random samples of 2. 42. How it compares with some other countries according to the number of homicides per 100. years with the organization.Chapter 6: Theory and methods of statistical sampling 221 15. gender. 16. Africa 25 44 Columbia Jamaica 47 59 Required 1. If you lived in Jamaica what is the probability that some day you would be a homicide statistic? 2. p. age at last birthday. . the country where the staff member is based. For budget reasons you are limited to interviewing a total of 40 people and some of these will be done by telephone but others will be personal interviews in the country of operation. Explain the difference between Questions 3 and 4.000 people were selected in Jamaica what is the proportion between 0.000 people were selected in Jamaica what is the proportion between 0. more fear.000 people is given in the table below10: Britain 2 United States 6 Zimbabwe Argentina Russia 8 14 21 Brazil S. Less crime. based in Paris has 248 personnel in its database according to the table below. doctors without borders. If you lived in Britain what is the probability that some day you would be a homicide statistic? Compare this probability with the previous question? What is another way of expressing this probability between the two countries? 3. safety concerns in the country of work. Humanitarian agency Situation A subdivision of the humanitarian organization.09% that would be homicide victims? 4.03% and 0.9 According to the statistics of 2005. 1 October 2005. Homicide Situation In December 2005. and other qualitative factors. 9 10 A murder in Jamaica.09% that would be homicide victims? 5. International Herald Tribune. p. If random samples of 1. This database gives in alphabetical order the name of the staff members. human relationships in the country. Steve Harvey. and their training in the medical field. an internationally known AIDS outreach worker was abducted at gunpoint from his home in Jamaica and murdered. 8.

Youssef Berard. Eloise Berton. Funmilayo Alexandre. Gaëlle Alibizzata. Eric Angue Assoumou. Agnès Baud. Aminata Banguebe. Ariane Boulay. Name Gender Age Years with agency 2 17 16 5 12 19 2 2 18 12 20 12 18 12 1 3 1 14 2 10 18 2 17 3 22 2 8 8 1 1 11 10 9 7 26 28 Country where based Chile Brazil Chile Kenya Brazil Kenya Chile Brazil Chile Cambodia Costa Rica Thailand Brazil Kenya Chile Vietnam Costa Rica Kenya Kenya Chile Costa Rica Brazil Vietnam Vietnam Ivory Coast Brazil Brazil Kenya Vietnam Kenya Chile Brazil Kenya Kenya Kenya Kenya Medical training 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Abissa. Audrey Awitor. Olivia Aulombard. Nicolas Batina. Sandrine Baque. Grégory Bouziat. Kimberley Blanchon. Ericka Akintayo. Abena Ahihounkpe. Which do you believe is the best experiment? Explain your reasoning: No. Yasmina Murielle Adekalom. Maxime Belkora. Lucas Briatte. Laetitia Beyschlag. Of the plans that you select draw conclusions. Thomas Bomboh. Bertrand Bossekota. Consider total random sampling. Sabrina Aubert. and strata sampling. Paul Blondet. cluster sampling. Maily Adjei. Mélodie Arfort. Natalie Black. Cédric Batty-Ample.222 Statistics for Business Required Develop a sampling plan to select the 40 people. Emmanuelle Bernard. Nicolas Aubery. Oumy Bakouan. Pierre-Edouard F F F F F F F M F F M F F M F F F M M F F M F F M F F F M M M M F M M M 26 45 41 29 46 46 31 30 47 47 50 34 49 36 27 24 41 42 32 31 44 46 41 40 46 28 34 32 23 34 31 32 37 36 53 48 Nurse General medicine Nurse Physiotherapy Nurse Nurse Radiographer Nurse Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Radiographer Nurse Surgeon Nurse Nurse Nurse Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Nurse Nurse . Patrick Bordenave. Euloge Ba. In all cases use the random number function in Excel to make the sample selection. Alexandra Besenwald. Myléne Ama.

Nathalie Doe-Bruce. Camille Declippeleir. Stanislas Coffy. Dagmar Chabanel. Mélanie Douenne. Natascha Buzingo. Romain Chartier.Chapter 6: Theory and methods of statistical sampling 223 No. Jonathan Ducret. Patrick Cablova. Ralou Maimouna Diehl. Johannson Czajkowski. Othalia Ayele E Donnat. Louise Cordier. Yan Crombe. Alexandre Collomb. François-Xavier Du Mesnil Du Buisson. Edouard Dubourg. Fanny Coradetti. Benjamin Delegue. Zineb Chahed. Aude Deplano. Camille F F M F F F F M F M M M M F F M M M M M M F M M M F M M F F F M F F F M M F F F M M M F 27 55 46 53 31 27 53 46 40 28 45 48 36 54 36 43 27 42 42 51 34 50 54 38 55 50 31 47 31 23 30 54 34 31 50 45 25 33 53 51 37 52 44 50 General medicine Nurse Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Nurse Radiographer Nurse Nurse Nurse Nurse Surgeon Nurse Nurse Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Nurse Nurse Nurse Radiographer Nurse Nurse Nurse Nurse Nurse General medicine Nurse Nurse Physiotherapy Nurse Nurse Surgeon Nurse Nurse Nurse (Continued) . Gael Chabanier. Henri Chaudagne. Joel De Messe Zinsou. Héloise Delobel. Delphine Demange. Jean-Michel Croute. Thierry De Zelicourt. Samy Chappon. Guillaume Desplanches. Ainaou Dansou. Laurence Bruntsch-Lesba. Isabelle Destombes. Mathieu Dadzie. Mohamed Dobeli. Pierre Diop. Hélène Diallo. Robin Coissard. Olivier Delahaye. Benjamin Cusset. Maud Chahboun. Kelly Dandjouma. Gonzague Debaille. Name Gender Age Years with agency 3 30 5 14 2 1 23 12 18 8 10 24 8 18 11 1 7 17 12 10 10 2 27 2 4 26 9 11 6 3 1 33 10 7 25 11 5 10 19 16 15 21 3 16 Country where based Ivory Coast Cambodia Thailand Kenya Ivory Coast Brazil Brazil Thailand Costa Rica Vietnam Chile Ivory Coast Chile Ivory Coast Cambodia Brazil Brazil Vietnam Cambodia Brazil Chile Nigeria Brazil Ivory Coast Cambodia Kenya Chile Ivory Coast Brazil Kenya Vietnam Thailand Thailand Brazil Ivory Coast Brazil Chile Chile Cambodia Thailand Ivory Coast Vietnam Cambodia Thailand Medical training 37 38 39 40 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 Brunel.

Vivienne Gava. Nicolas Gaillardet. Guillaume Honnegger. Name Gender Age Years with agency 25 5 15 11 2 6 8 5 18 3 12 4 5 9 10 3 3 6 11 13 4 5 9 11 33 1 13 3 2 8 5 3 9 2 7 14 16 35 7 2 3 13 4 1 24 Country where based Chile Costa Rica Kenya Cambodia Brazil Cambodia Chile Thailand Ivory Coast Ivory Coast Brazil Kenya Vietnam Brazil Ivory Coast Brazil Kenya Costa Rica Brazil Brazil Ivory Coast Thailand Kenya Nigeria Costa Rica Costa Rica Chile Brazil Thailand Cambodia Brazil Chile Cambodia Vietnam Thailand Costa Rica Vietnam Chile Kenya Brazil Vietnam Brazil Brazil Brazil Brazil Medical training 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 Dufau. Charlotte Gassier. Philippe Kervaon. Gilles Hazard. Stéphanie Flandrois. Stéphanie Felio. Darko Kassab. Yannick Errai.224 Statistics for Business No. Joeata Kasalica. Sophie Garnier. Camille Guillot. Sébastien Jomard. Sébastien Dutraive. Charles Garraud. Mathilde Gerard. Nicholas Hardy. Guillaume Dufaud. Hélène Jiguet-Jiglairaz. Dorothée Houdin. Caroline Escarboutel. Julia Huang. Loïc Kacou. Christel Etien. Valentin Ginet-Kauders. Antoine Gueit. Nathalie M M F M M F M M F F F M M F M F F M F F F M F M M F M M F M M M M F F F F M M F F F M M F 45 28 36 50 33 26 28 47 42 52 54 32 29 32 31 23 31 27 43 33 26 50 29 40 54 32 33 31 46 45 33 30 38 45 49 35 47 55 34 35 48 51 24 29 45 Nurse Radiographer Nurse Nurse Nurse Physiotherapy Nurse Nurse General medicine Nurse Nurse Nurse Surgeon Nurse Nurse Nurse Nurse Nurse Radiographer Nurse Nurse Physiotherapy Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Nurse Surgeon Nurse Nurse Nurse Nurse Radiographer Nurse . Sébastien Fernandes. Charly Dujardin. Claudio Fillioux. Aneta Kasalica. Vincent Germany. Benjamin Eberhardt. Delphine Guerite. David Gobber. Shan-Shan Jacquel. Skander Erulin. Nadine Ebibie N’ze. Baptiste Gremmel. Aurélie Grangeon. Agathe Dutel. Marion Garapon. Sam Julien. Julie Gesrel.

Stéphanie Maskey. Brenda Ozkan. William Perrenot. Dorothée Miribel. Claude Ndiaye. Stéphanie Martinez. Mélinda Mermet. Julien Lu Shan Shan Marchal. Fréderic Okewole. Alexandra Mermet. Madeleine Kolow. Guillaume Oculy. Arthur Marganne. Si Si Liubinskas. Maxine Omba. Johan F M F M M M F F M M F M M F F F F F M F F F F M F F M F F M M M M M M M F F F F F M M M 44 40 50 42 38 37 29 32 25 34 31 45 25 33 42 46 25 23 29 48 25 27 54 53 40 53 32 55 35 50 37 28 29 45 51 47 28 25 43 55 38 43 30 47 Nurse Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Nurse Nurse Nurse Nurse Radiographer Nurse Nurse Nurse Nurse Nurse Nurse Surgeon Nurse Physiotherapy Nurse Nurse Nurse Nurse Nurse Nurse Surgeon Nurse Nurse Nurse Nurse Nurse Radiographer (Continued) . Cloé Perera. Laura Murgue. Selda Paillet. Julien Lestangt. Ricardas Loyer. Cédric Mathisen. Julien Montfort. Emmanuel Nddalla-Ella. Florence Michel. Lati Martin. Jean-Philippe Neves. Alexandre Latini. Maïté Penillard. Baye Mor Neulat. Nguie Ostler. Baptiste Lehot. Richard Marone. Name Gender Age Years with agency 11 7 23 14 18 16 8 5 4 10 8 12 4 4 5 16 5 1 8 12 1 1 24 6 5 16 2 24 14 23 17 2 8 12 21 24 1 1 21 28 5 17 3 3 Country where based Costa Rica Chile Chile Brazil Ivory Coast Vietnam Cambodia Chile Cambodia Vietnam Chile Nigeria Chile Brazil Kenya Brazil Thailand Vietnam Brazil Cambodia Brazil Brazil Brazil Vietnam Chile Nigeria Kenya Thailand Thailand Vietnam Cambodia Brazil Brazil Chile Kenya Ivory Coast Brazil Kenya Nigeria Kenya Ivory Coast Nigeria Kenya Costa Rica Medical training 127 128 129 130 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 Kimbakala-Koumba. Stéphane Lauvaure. Cyrielle Martin. François Nauwelaers. Julien Legris. Julien Monnot.Chapter 6: Theory and methods of statistical sampling 225 No. Aurélie Li. Christophe Pesenti. Christophe Nicot. Emilie Owiti. Lilly Masson.

Alexis Roy. Amir Schwartz. Grégory Ramanoelisoa. Andrew Ravets. Sébastien Rouviere. Steven Ruget. Steven Souah. Marie-Charlotte Rudkin. Ludovic Portmann. Alexandre Richard. Pierre-Jean Sassioui. Olivier Seimbille. Eliane Goretti Rambaud. Jennifer Prou. Emmanuelle Ribieras.226 Statistics for Business No. Steve F F M M F M M F M M F M M M M M M F M M F M F M M F F F M M M F F F F M M M M F F M M M 48 39 24 45 41 42 55 49 27 43 30 45 23 41 37 51 23 51 41 24 38 35 45 35 55 45 31 51 32 48 47 54 33 54 53 39 46 47 47 51 36 34 26 50 Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse General medicine Nurse Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Radiographer Nurse Nurse Nurse Nurse Surgeon Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse . Joffrey Rutledge. Céline Philetas. Florent Salami. Martin Sok. Stéphanie Schmuck. Dominique Pfeiffer. Diana Ruzibiza. Vincent Raffaele. Aurélie Schulz. Name Gender Age Years with agency 17 7 3 21 15 7 26 18 1 5 8 14 1 9 2 31 1 14 21 4 11 12 10 13 22 20 5 23 2 10 18 34 13 9 3 10 21 1 23 13 12 3 1 23 Country where based Thailand Ivory Coast Thailand Chile Thailand Chile Cambodia Vietnam Costa Rica Brazil Nigeria Cambodia Kenya Brazil Thailand Ivory Coast Kenya Costa Rica Thailand Brazil Thailand Brazil Brazil Vietnam Brazil Cambodia Ivory Coast Vietnam Cambodia Chile Brazil Ivory Coast Brazil Costa Rica Costa Rica Vietnam Brazil Chile Ivory Coast Brazil Brazil Costa Rica Costa Rica Ivory Coast Medical training 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 Petit. Kevin Pourrier. Nicolas Rossi-Ferrari. Céline Schneider. Pascale Saphores. Arnaud Savinas. Benjamin Sib. Irena Six. Mistoura Sambe. Damien Rocourt. Mamadou Sanvee. Grégory Roux. Hubert Ruzibiza. Philippe Ranjatoelina. Khalid Saint-Quentin. Alexandra Servage. Tamara Schadt. Mohamed Savall. Brigitte Sinistaj. Oriane Sadki.

Marion Vorillon. Wenjie SuperVielle Brouques. Claire Villet. Claire Tahraoui. Name Gender Age Years with agency 7 22 2 4 7 1 8 3 18 8 19 1 2 17 5 17 8 2 4 10 2 18 13 2 30 6 15 1 13 18 26 3 9 Country where based Nigeria Vietnam Brazil Kenya Kenya Thailand Thailand Costa Rica Ivory Coast Kenya Thailand Brazil Ivory Coast Chile Brazil Costa Rica Nigeria Ivory Coast Chile Vietnam Vietnam Brazil Ivory Coast Kenya Brazil Kenya Chile Costa Rica Brazil Brazil Vietnam Ivory Coast Thailand Medical training 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 Souchko. Samy Wernert. Jessica Weigel. Diana Vincent. Leila Zeng. Laure Tillier. Imelda Wallays. Ning Yuan. Christophe Vande-Vyre. Julien Villemur. Anna Straub. Kémi Triquere. Elodie Sun. Li Zhao. Lucile Willot. Debora Xheko.Chapter 6: Theory and methods of statistical sampling 227 No. Zhiyi Zairi. Anne Wang. Fabrice Wadagni. Pauline Trenou. Lizhu M F F F F F F M F F M M F M M F F F M F F F M F M M F F F M F F F 38 52 25 31 40 31 33 49 39 29 44 23 40 55 25 41 33 36 32 45 30 38 34 24 52 40 46 28 48 39 51 25 33 Radiographer Nurse Nurse Physiotherapy Surgeon Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Physiotherapy Nurse Nurse Nurse General medicine Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Nurse Physiotherapy Surgeon Radiographer Nurse General medicine . Kadiatou Tarate. Davina Tall. Cyril Tshitungi. Mathieu Wlodyka. Romain Tessaro. Eni Xu. Sébastien Wurm. Mesenga Vadivelou. Edouard Soumare.

This page intentionally left blank .

000 people in each of the 10 indicated countries. one of the 25-member states.. Cyprus. Wall Street Journal Europe. and Greece are opposed to membership. Austria has not forgotten fighting back the invading Ottoman armies in the 16th and 17th centuries. This agreement came only after a tense night-and-day discussion with Austria. after a very heated debate.. which is the essence of the material in this chapter. This estimated information is based on a survey response of a sample of about 1. Germany.1 where an estimated 70% or more of the population in each of Austria. M. France. This survey was made to estimate population characteristics.Estimating population characteristics 7 Turkey and the margin of error The European Union. M. “Turkey gains EU approval to begin membership talks”. who strongly opposed Turkey’s membership. The survey was conducted in the period May–June 2005 with an indicated margin of error of 3% points. 4 October 2005. 1 and 14. pp.1 1 Champion. Reservations to Turkey’s membership is also very strong in other countries as shown in Figure 7. a Muslim country of 70 million people. and Karnitschnig. agreed in early October 2005 to open membership talks to admit Turkey. .

Austria Cyprus France Germany Greece Italy Poland Sweden Turkey UK 0% 10% 20% 30% 40% 50% 60% 3% Against 70% 80% 90% 100% Margin of error is In favor Undecided .1 Survey response of attitudes to Turkey joining the European Union.230 Statistics for Business Figure 7.

Point estimates In estimating. By its very nature. To facilitate comprehension the chapter is organized as follows: ✔ ✔ ✔ ✔ ✔ Estimating the mean value • Point estimates • Interval estimates • Confidence level and reliability • Confidence interval of the mean for an infinite population • Application of confidence intervals for an infinite population: Paper • Sample size for estimating the mean of an infinite population • Application for determining the sample size: Coffee • Confidence interval of the mean for a finite population • Application of the confidence interval for a finite population: Printing Estimating the mean using the Student-t distribution • The Student-t distribution • Degrees of freedom in the t-distribution • Profile of the Student-t distribution • Confidence intervals using a Student-t distribution • Excel and the Student-t distribution • Application of the Student-t distribution: Kiwi fruit • Sample size and the Student-t distribution • Re-look at the example kiwi fruit using the normal distribution Estimating and auditing • Estimating the population amount • Application of auditing for an infinite population: tee-shirts • Application of auditing for a finite population: paperback books Estimating the proportion • Interval estimate of the proportion for large samples • Sample size for the proportion for large samples • Application of estimation for proportions: Circuit boards Margin of error and levels of confidence • Explaining margin of error • Confidence levels In Chapter 6. μx.75. the proportion of the population expected to vote Republican.45. The problem with one value or a point estimate is that they are presented as being exact and that unless we have a super Estimating the Mean Value The mean or average value of data is the sum of all the data taken divided by the number of . measurements taken. Thus from samples we might with confidence estimate the mean weight of airplane passengers for fuel-loading purposes. we might select at random 20 items of inventory from a distribution centre and calculate that their average value is £25. However. For example. or infer. volume. The units of measurement can be financial units. we discussed statistical sampling for the purpose of obtaining information about a population. or the mean value of inventory in a distribution centre. we could use a single value to estimate the true population mean. population parameters based entirely on the sample data.75 then we might estimate that the population average of all students is also 3. if the sample experiment is correctly designed then there should be a reasonable confidence about conclusions that are made. In this case we would estimate that the population average of the entire inventory is £25. etc. weight.45. This chapter expands upon this to use sampling to estimate. Or. estimating is probabilistic as there is no certainty of the result. Here – we have used the sample mean x as a point estimate or an unbiased estimate of the true population mean.Chapter 7: Estimating population characteristics 231 Learning objectives After you have studied this chapter you will understand how sampling can be extended to make estimates of population parameters such as the mean and the proportion. if the grade point average of a random sample of students is 3. length.

the probability of them being precisely the right value is low. the mean life of the compressors is about 72 months and there is a 68. the client needs information about the life of compressors since the compressor is the principal working component of the refrigerator.25 73. z or 1 x 72 1. However. information says nothing about the reliability or confidence that we have in the estimate.25 or 2.500 units in the first year and I am 90% confidence of these figures. or it means that z 1.26% (about 68%) probability that the mean value will be between 70. Two standard errors of the mean.232 Statistics for Business crystal ball. The estimate for the sales of the new products is between 22. Again from .25 months. The estimate for the project cost is between $11.25 x µx σx / n Confidence level and reliability Suppose a subcontractor A makes refrigerator compressors for client B who assembles the final refrigerators. 72 1. Using the concept of point estimates we could say that the mean life of all the compressors manufactured is 72 months. – Here x is the estimator of the population mean μx and 72 months is the estimate of the population mean obtained from the sample. The subcontractor has been making these compressors for a long time and knows from past data that the standard deviation of the working life of compressors is 15 months. the standard error of the mean can be calculated by using equation 6(ii) from Chapter 6 from the central limit theory: σx σx n 15 144 15 12 1. From equation 6(iv). is determined to be 6 years or 72 months.25 months is one standard error of the mean.50 months.26% of all values in the distribution lie within 1 standard deviations from the mean. In practice it is more meaningful to have an interval estimate and to quantify these intervals by probability levels that give an estimate of the error in the measurement.25 months Interval estimates With an interval estimate we might describe situations as follows. they are either right or wrong.25 months Thus we can say that.25 70. is 2 * 1. Then since our sample size of 144 is large enough. Assume that a random sample of 144 compressors is tested and that the – mean life of the compressors. The estimate of the price of a certain stock is between $75 and $90 but I am only 50% confident of this information. If we assume that the life of a compressor follows a normal distribution then we know from Chapter 5 that 68.9 million and I am 95% confident of these figures. This value of 1. x When z x 72 1. The estimate of class enrolment for Business Statistics next academic year is between 220 and 260 students though I am not too confident about these figures. Point estimates are often inadequate as they are just a single value and thus. Thus the interval estimate is a range within which the population parameter is likely to fall. for the sampling distribution. this When z 1 then the lower limit of the compressor life is.8 and $12. In order to establish the terms of the final customer warranty.00. or when z 2.75 months 1 then the upper limit is.75 and 73.000 and 24. x.

It is important to note that as our confidence level increases. Finally.73% (about 100%) probability that the mean value will be between 68. going from a range of 2. 233 2 then the upper limit is. 3.75 months Thus we can say that the mean life of the compressor is about 72 months and there is almost a 99.75 to 73.25 months.75 months. 72 3 * 1.50 months is in the range 69. about 95% of them would include the true population mean. 2. the lower limit of compressor life is. of all values in the distribution lie within 2 standard deviations from the mean. three standard errors of the mean is 3 * 1. This is to be expected as we become more confident of our estimate.75 months and again from Chapter 5.3. The best estimate is that the mean compressor life is 72 months and the manufacturer is about 68% confident that the compressor life is in the range 70. or a range of 7. The best estimate is that the mean compressor life is 72 months and the manufacturer is about 95% confident that the compressor life . of all values in the distribution lie within 3 standard deviations from the mean.00 months.25 to 75. From the above compressor example. and 5 contain the population mean μ. x When z x 72 3 * 1. considering the 2σ confidence intervals. about 5% of them would not.25 and 75. the lower limit of the compressor life is. Here the confidence interval is between 68. As we have shown in the compressor situation.50 months Thus we can say that.75 and 73.50 months as the respective lower and upper limits.25 months Confidence interval of the mean for an infinite population The confidence interval is the range of the estimate being made. x When z x 72 2 * 1. the area in each tail is α/2 as shown in Figure 7.75 months. Since the distribution is symmetrical.44% (about 95%) probability that the mean value will be between 69.25 or 3.25 and 75. The 2σ intervals for sample numbers 1.50 months.50 and 74.50 months. where α is the total proportion in the tails of the distribution outside of the confidence interval.25 69.50 months.50 and 74. 4.25 74.2 for six different samples.50 and 74.73%. When z 3 then using equation 6(iv). μ. or a range of 5. A 95% confidence interval estimate implies that if all possible samples were taken.44% of the area under the normal curve. When z 2 then using equation 6(iv). whereas for samples 3 and 6 do not contain the population mean μ within their interval. the confidence interval increases. or about 95%. 95.25 months. The level of confidence is (1 α). The best estimate is that the mean compressor life is 72 months and the manufacturer is about 100% confident that the compressor life is in the range 68. This concept is illustrated in Figure 7.50 months. Thus in summary we say as follows: 1.Chapter 7: Estimating population characteristics Chapter 5. 2. Between these limits this is equivalent to 95.50 months.75 months. somewhere within their interval. or a range of 2.50 months. we give a broader range to cover uncertainties. Here the confidence interval is between 70. we have 69. the mean life of the compressor is about 72 months and there is a 95.44%. whereas. assuming a normal distribution. if we assume a normal distribution. Here the confidence interval is between 69.50 to 7. 72 2 * 1.25 75.50 to 74. going from 68% to 100%. 99. the 3 then the upper limit is.25 68.

5 Figure 7. x z σx n μx x z σx n 7(ii) Application of confidence intervals for an infinite population: Paper Inacopia. This implies that the population mean lies in the range given by the relationship.9986 cm.00 cm and it is known that the standard deviation of the cutting machine is 0. x z σx x z σx n 7(i) .2 Confidence interval estimate. Determine the 95% confidence intervals of the mean width of all the A4 paper coming off the production line? 2 Mean Confidence interval 2 confidence intervals for the population estimate for the mean value are thus. the Portuguese manufacturer of A4 paper commonly used in computer printers wants to be sure that its cutting machine is operating correctly. 6 2sx Interval for sample No. 4 2sx Interval for sample No. The width of A4 paper is expected to be 21. 2 x2 x6 x5 x4 2sx Interval for sample No. 1.234 Statistics for Business Figure 7. The quality control inspector pulls a random sample of 60 sheets from the production line and the average width of this sample is 20. 1 x1 2sx Interval for sample No. 2sx Interval for sample No.0100 cm.3 Confidence interval and the area in the tails. 3 x3 x1 x2 m x5 x4 x6 x3 2sx Interval for sample No.

9600. is – – (x μx) or μx x on the left side of the distri– μ on the right side of the bution.9986 1.9953–21. Using [function NORMSINV] in Excel for a value P(x) of 2. Since this interval contains the population expected mean value of 21. x .0013 From equation 7(i) the confidence limits are. 20. If the sample size is large there may be only a marginal gain in reliability in the estimate of our population mean but what is certain is that the analytical experiment will be more expensive.9600 * 0.9986 cm and we are 95% confident that the width is in the range 20. If the sample size is small.0013 and 20.50% 95%) which is the area of the curve from the left to the upper value of z. is 20. n.9986 cm Population standard deviation.9953 235 21.0000 cm.9986 2.0019. Sample size for estimating the mean of an infinite population In sampling it is useful to know the size of the sample to take in order to estimate the population parameter for a given confidence level.50% (2.5758.0011.0013 and 20.0100 Standard error of the mean is. Since the distribution is symmetrical.5%.5758. σ n 0. σ. Thus.) From equation 7(i) the confidence limits are. (Note: an alternative way of finding the upper value of z is to enter in [function NORMSINV] the value of 97.0100 60 0. the upper value is 2. is 60 – Sample mean.5% gives a lower value of z of 1.0011 cm Thus we would say that our best estimate of the width of the computer paper is 20. The area in each tail for a 99% confidence limit is 0. we can conclude that there seems to be no problem with the cutting machine. by equation 6(iv) or. on the left side of the distribution when z is negative.9600.9961–21.5%. (Note: an alternative way of finding the upper value of z is to enter in [function NORMSINV] the value of 99. to take for a given confidence level? The confidence limits are related the sample size. since this interval contains the expected mean value of 21. Since the distribution is symmetrical. Using [function NORMSINV] in Excel for a value P(x) of 0. is 0.0000 cm. n.0019 cm The area in the each tail for a 95% confidence limit is 2.5758 * 0.50% 99%) which is the area of the curve from the left to the upper value of z. Again. what is an appropriate sample size.9961 Thus we would say that our best estimate of the width of the computer paper is 20.50% (0.Chapter 7: Estimating population characteristics We have the following information: Sample size. we can conclude that there seems to be no problem with the cutting machine. We have to accept that unless the whole population is analysed there will always be a sampling error. n.) σx / n The range from the population mean.9986 cm and we are 99% confident that the width is in the range 20. 2. z x µx 6(iv) 21. the upper value is numerically the same at 1. Note that the limits in Question 2 are wider than in Question 1 since we have a higher confidence level. Determine the 99% confidence intervals of the mean width of all the A4 paper coming off the production line. and x x . the chances are that the error will be high.5% gives a lower value of z of 2. 20.

These include printing errors of colour and alignment. x μx. The . the subject gives.9600. the upper value is numerically the same at 1. n ⎛ zσx ⎞2 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ e ⎟ ⎝ ⎠ 7(iv) Thus for a given confidence level. and a given confidence limit the required sample size can be determined. the quality control inspector looks at 45 random pages selected from the book and finds that the average number of errors in these pages is 2. which then gives the value of z. if the population is considered finite.50 ⎠ ⎝ 61.50% 95%) which is the area of the curve from the left to the upper value of z. The following worked example illustrates the concept of confidence intervals and sample size for an infinite population.236 Statistics for Business distribution curve. z is 1. n ⎛ zσx ⎞2 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ e ⎟ ⎝ ⎠ In this case the confidence limits for the population estimation from equation 7(i) are modified as follows: σx n x z σx x z (N (N 1) n) 7(v) The area in the each tail for a 95% confidence limit is 2. Since the distribution is symmetrical.9600. but also typing errors which originate from the author and the editor.50 g. Note in equation 7(iv) since n is given by squared value it does not matter if we use a negative or positive value for z.50 g n ⎛ 1. Reorganizing equation 6(iv) by making the sample size.463 u 62 (rounded up) Thus the quality control inspector should take a sample size of 62 (61 would be just slightly too small). Confidence interval of the mean for a finite population As discussed in Chapter 6 (equation 6(vi)).5%. – The term. After the book is printed. What sample size should the inspector take to be 95% confidence of the estimate? Using equation 7(iv). is the sample error and if we denote this by e. then the sample size is given by. (Note: an alternative way of finding the upper value of z is to enter in [function NORMSINV] the value of 97. n. then the standard error should be modified by the finite population multiplier according to the expression. 1. n ⎛ zσ ⎞2 ⎜ x ⎟ ⎟ ⎜ ⎜x μ ⎟ ⎟ ⎜ ⎝ x⎠ 7(iii) Here.9600 * 2.00 ⎞2 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 0.50% (2. Using [function NORMSINV] in Excel for a value P(x) of 2.) Application of the confidence interval for a finite population: Printing A printing firm runs off the first edition of a textbook of 496 pages.9600 (it does not matter whether we use plus or minus since we square the value) σx is 2 g e is 0.5% gives a lower value of z of 1.70. It is known that the standard deviation of the coffee filling machine is 2 g. that is the ratio n/N is greater than 5%. σx σx n ⋅ N N n 1 6(vi) Application for determining sample size: Coffee The quality control inspector of the filling machine for coffee wants to estimate the mean weight of coffee in its 200 gram jars to within 0.

from which the Student-t Uncorrected standard error of the mean is σx n σx n 0.5. we must use the finite population multiplier: N N n 1 496 45 496 1 451 495 0. Thus.5 45 0.84 Thus we could say that the best estimate of the errors in the book is 2. we are obliged to increase our sampling size to at least 30 units in order that the sampling distribution of the means will be approximately normally distributed. The Student-t distributions are a family of distributions each one having a different shape and characterized by a parameter called the degrees of freedom.70 per page and that we are 95% confident that the errors lie between 2.70 Population standard deviation. the upper value is numerically the same at 1.0711 Confidence level is 95%.84 errors per page. The Student-t distribution In Chapter 6. any sample size will give a sampling distribution of the means that are approximately normal. Since the distribution is symmetrical. if we sample from populations that are not normal. is 45 Population size. errors per page is 2. such that we know the standard deviation. What is a 95% confidence interval for the mean number of errors in the book? Sample size. is 496 – Sample mean. σ. σx N N n 1 0.5% gives a lower value of z of 1.70 1. x . 2.5 Ratio of n/N is 45/496 9. σ.0745 * 0.9600. what do we do when we have small sample sizes that are less than 30 units? To be correct.Chapter 7: Estimating population characteristics inspector knows that based on past contracts for a first edition of a book the standard deviation of the number of errors per page is 0.0745 Corrected standard error of the mean.9600 * 0. It was developed by William Gossett of the Guinness Brewery. 1. or more simply the t-distribution. “Sample size and shape of the sampling distribution of the means”.56 Thus from equation 7(v) the upper confidence limit is. thus area in each tail is 2. . is a continuous distribution for small amounts of data.9545 237 Estimating the Mean Using the Student-t Distribution There may be situations in estimating when we do not know the population standard deviation and that we have small sample sizes.9600.9545 0. 2. In this case there is an alternative distribution that we apply called the Student-t distribution. If we sample from population distributions that are normal.0711 2. in Dublin. thus.0711 2. we should use a Student-t distribution. Thus from equation 7(v) the lower confidence limit is. n.70 1.5% Using [function NORMSINV] in Excel for a value P(x) of 2. The Student-t distribution. in the paragraph entitled.07% This value is greater than 5%. is 0. we indicated that the sample size taken has an influence on the shape of the sampling distributions of the means. The density function. like the normal distribution. Ireland in 1908 (presumably when he had time between beer production!) and published under the pseudonym “student” as the Guinness company would not allow him to put his own name to the development. However.56 and 2.9600 * 0. N.

the degrees of freedom means the choices that you have regarding taking certain actions. Here. For example. assume that we give v. z. In general terms. π is the value of 3. 11. compared to the Student-t distribution. using (n 1) are respectively 5. is fixed at a value of 5 in order to retain the validity of the equation. has the following relationship: f (t) ⎡( υ 1) / 2⎤ ! ⎣⎢ ⎦⎥ ⎡( υ 2) / 2⎤ υπ ⎢ ⎥⎦ ⎣ ⎡ ⎢1 ⎢ ⎣ t2 ⎤⎥ υ ⎥⎦ ( υ 1)/2 7(vi) Thus automatically the fifth variable. This then implies that there is a degree of freedom for every sample size. for sample size n of 6. and y the values 14. and 22.4. The degrees of freedom for these curves. This is the penalty you pay for small sample sizes and where the sampling is taken from a non-normal population. 12.1416. the degree of freedom is the value determined by (n 1).5. respectively. The Student-t distribution is flatter and you have to go further out on either side of the mean value before you are close to the x-axis indicating greater variability in the sample data. and z that are related by the following equation: v w x 5 y z 13 7(vii) Since there are five variables we have a choice. and 18. and 21. υ is the degree of freedom. x. and t is the value on the x-axis similar to the z-value of a normal distribution. the value of the fifth variable is automatically fixed. w. Here we had five variables to give a degree of freedom of four. to select four of the five. To understand quantitatively the degrees of freedom consider the following. or sample sizes less than 30. we see that the normal distribution is higher at the peak and the tails are closer to the x-axis. 16. for a sample size of n units. 14 z 16 12 18 5 z 13 Confidence intervals using a Student-t distribution When we have a normal distribution the confidence intervals of estimating the mean value of the population are as given in equation 7(i): x z σx n 7(i) 5 * 13 (14 16 12 18) 65 60 5 . are illustrated in Figure 7.4 the curve for a sample size of 22 has a smaller variation and is higher at the peak. For example. These three curves have a profile similar to the normal distribution but if we superimposed a normal distribution on a Student-t distribution as shown in Figure 7. Profile of the Student-t distribution Three Student-t distributions. w. There are five variables v. what is the degree of freedom that you have in manoeuvring your car into a parking slot? What is the degree of freedom that you have in contract or price negotiations? What is the degree of freedom that you have in negotiating a black run on the ski slopes? In the context of statistics the degrees of freedom in a Student-t distribution are given by (n 1) where n is the sample size. y.238 Statistics for Business distribution is drawn. or the degree of freedom. 12. Degrees of freedom in the Student-t distribution Literally. Then from equation 7(vii) we have. After that. As the sample size increases the profile of the Student-t distribution approaches that of the normal distribution and as is illustrated in Figure 7. x.

However. ˆ σ s Σ(x (n x )2 1) 7(ix) Student-t distribution We could avoid writing σ. Equation 7(i) is modified to give the following: x t ˆ σx n 7(viii) Normal distribution Here the value of t has replaced z. This new term. means an estimate of the population ˆ standard deviation. the sample standard deviation by the relationship. the population standard deviation.4 Three Student-t distributions for different sample sizes. by putting σ it is clear that our ˆ only alternative to estimate our confidence . and ˆ simply write s since they are numerically the same. as some texts do. and σ has ˆ replaced σ. σ. Numerically it is equal to s.Chapter 7: Estimating population characteristics 239 Figure 7.5 Normal and Student-t distributions. Mean value Sample size 6 Sample size 12 Sample size 22 Figure 7. When we are using a Student-t distribution.

x 100. x 100. we use a Student-t distribution.5346 3 5.5346 Application of the Student-t distribution: Kiwi fruit Sheila Hope.5346 3 5. the number of tails is always two – that is. ˆ σx n 12. is 25. Table 7. is 100.2312 95. Area outside of confidence interval. (n 1).00 2. (This is not necessarily the case for hypothesis testing that is discussed in Chapter 8. (100% 95%) is 5%. n. one on the left and one on the right. Using [function TINV].01 mg and the upper level . which determines the probability or area for a given random variable x. Required confidence level (given) is 95%.0639.0639 * 2.6731. s. Milligrams of vitamins per kiwi Excel and the Student-t distribution There are two functions in Excel for the Student-t distribution.1 sampled. x . Degrees of freedom. σ s 12. Using [function STDEV]. (Note the difference in the way you enter the variables for the Student-t and the normal distribution. is 5.47 Thus the estimate of the average level of vitamin C in all the imported kiwis is 100. mean value of – the sample. Estimate the average level of vitamin C in the imported kiwi fruits and give a 95% confidence level of this estimate. Table 7. Using [function SQRT].) The other function is [function TINV] and this determines the value of the Student-t under the distribution given the total area outside the curve or α. is 24. α.240 Statistics for Business limits is to use an estimate of the population standard deviation as measured from the sample. ˆ Standard error of the sample distribution. and the sample size of 25 is less than 30.1 gives the results in milligrams per kiwi sampled.24 t ˆ σx n 2. One is [function TDIST].6731 5.00. in order to compare this information with kiwi fruits grown in the Central Valley. From equation 7(viii). standard deviation of the sample.2312 105.01 t ˆ σx Upper confidence level. California wants to know in milligrams.0639 * 2. the degree of freedom. Estimate of the population standard deviation. Since we have no information about the population standard deviation.24. California. When we use the t-distribution in estimating.24 n 2. For the Student-t you enter the area in the tails. n.6731.24 100.) 109 101 114 97 83 88 89 106 89 79 91 97 94 117 107 136 115 109 105 100 93 92 110 92 93 Using [function AVERAGE].24 mg with a 95% confidence that the lower level of our estimate is 95.24 100. 1. Lower confidence level. the level of vitamin C in a boat load of kiwi fruits imported from New Zealand. whereas for the normal distribution you enter the area of the curve from the extreme left to a value on the x-axis. Student-t value is 2. the Agricultural inspector at Los Angeles. Sample size. square root of the sample size. is 12. Sheila took a random sample of 25 kiwis from the ship’s hold and measured the vitamin C content. and the number of tails in the distribution.

50% 105.35% 3.0010 1.9830 1.9723 1.0049 2.9600 1.9600 1. a sample size of 30 or a sample size of 120? The movement of the value of t relative to the value of z is illustrated by the data in Table 7.9745 1.9729 1.47 Values of t and z with different sample sizes.68% 0.9600 1.9600 1.9778 1.20% 2.7765 2.66% 1.79% 5.47 mg.9600 1.2 95.9600 1. What should we use.24 mg Confidence interval 2. Sample size and the Student-t distribution We have said that the Student-t distribution should be used when the sample size is less than 30 and the population standard deviation is unknown.2 and the corresponding graph in Table 7.9600 1.64% 0.00% 2.9801 1.9600 1.9600 1.9600 1.29% 2.0930 2. 95.9793 1.9750 1.9600 1.9600 1.2622 2.0322 2.18% 1.0154 2.9600 1.9755 1.9600 1.09% 1.83% 2.1448 2.9925 1.9600 (t z z) 41.9600 1.12% 1.9600 1.9905 1.9600 1.88% 0.74% 0.93% 1.0227 2.0639 2. 241 Figure 7.9886 1.9949 1.9600 1.42% 9.9870 1.9820 1.9760 1.24% 1.61% .9600 1.9741 1.Chapter 7: Estimating population characteristics is 105.9977 1.00% 2.6 Confidence intervals for kiwi fruit.66% 15.38% 1.9600 1.85% 0.9600 1.46% 1.01 100.0096 2. This information is illustrated on the Student-t distribution in Figure 7.50% 97.9733 1.72% 0.9600 1.9600 1.91% 0.56% 1.00% 5.78% 1.9600 1.9600 1.50% Upper z 1.9600 1.9600 Confidence level Area outside Excel (lower) Excel (upper) Sample size.9737 1.9600 1.9600 1.30% 1.0452 2.77% 0.9600 1. n 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Upper Student-t 2.9842 (t z z) Sample size.9600 1.6. Some analysts are more rigid and use a sample size of 120 as the cut-off point.9855 1.9785 1.9600 1.9720 Upper z 1.79% 0.53% 2.07% 1.63% 0.9766 1.9600 1.03% 0. n 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 Upper Student-t 1.9600 1.82% 0.99% 0.69% 3.9600 1.9600 1.9600 1.43% 6.9600 1.50% 95.70% 0.9600 1.9772 1.9810 1.30% 4.66% 0.95% 0.9726 1.

2500 2.0500 2.7500 2. In the column (t z)/z we see that the difference between t and z is 4.6000 2.5000 2. Re-look at the example Kiwi fruit using the normal distribution Here all the provided data and the calculations are the same as previously but we are going to assume that we can use the normal distribution for our analysis.8000 2.2000 2. When the sample size increases to 120 then the difference is just 1.35% for a sample size of 30.1000 2.3500 2.5% in both tails for a symmetrical . Let Upper z or t value us take another look at the kiwi fruit example from above using z rather than t values. (100% 95%) is 5%.5500 2.8500 2. Here we have the Student-t value for a confidence level of 95% for sample sizes ranging from 5 to 200. We have to remember that we are making estimates so that we must expect errors. The value of z is also shown and this is constant at the 95% confidence level since z is not a function of sample size.03%.6500 2. 2.242 Statistics for Business Figure 7. n Upper t value Upper z value Figure 7. Is this difference significant? It really depends on what you are sampling. In the medical field small differences may be important but in the business world perhaps less so.7000 2.0000 1. Required confidence level (given) is 95%. α.3000 2.7 As the sample size increases the value of t approaches z.9000 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample size.1500 2. Area outside of confidence interval.9500 1.4000 2.4500 2.7. which means that there is an area of 2.

9600.9678 It is unlikely we know the standard deviation of the large population of inventory and so we would estimate the value from the sample. are. inventory held in a distribution centre when.00 1.24 1. 100.9600 * 2. From equation 7(i). If N is the total number of units. The owner takes a random sample of 29 items and Table 7. tank tops. N. then the point estimate for the population total is the size of the population.Chapter 7: Estimating population characteristics distribution. for example. x z x n 100. the sample size is less than 30.9678 105.27 ˆ σ Upper confidence level. inventory items. or. Confidence intervals: N x Nt ˆ σ n 243 7(xi) Alternatively. x z ˆ σx n 100. Since in reality we would report probability of our confidence for the vitamin level of kiwis between 95 and 105 mg. The inventory records indicate that there are 4.21 The corresponding values that we obtained by using the Student-t distribution were 95. it is impossible or very time consuming to make an audit of the population. and sweaters that it has in its store.47 or a difference of only some 0. the difference between using z and t in this case in insignificant.01 and 105.3%.5346 0 95. then the standard error has to be modified by the estimated finite population multiplier to give.5346 0 100. Application of auditing for an infinite population: tee-shirts A store on Duval Street in Key West Florida.500 of these clothing articles on the shelves. In this case we first take a random and representative sample and deter– mine the mean financial value x .24 4.24 4. Lower confidence level. – Total N x 7(x) The following two applications illustrate the use of estimating the total population amount for auditing purposes. for example. Thus the confidence intervals when the standard deviation is unknown. and the population is finite. Using [function NORMSINV] in Excel for a value P(x) of 2.3 gives the prices in dollars indicated on the articles. Confidence intervals: ˆ σ N n N x Nt n N 1 7(xiii) Estimating the population amount We can use the concepts that we have developed in this chapter to estimate the total value of goods such as. If the sample size is less than 30 we use the Student-t distribution and the confidence intervals are given as follows by multiplying both terms in equation 7(viii) to give. multiplied by the sample mean. Estimated standard error: ˆ σ n N N n 7(xii) 1 Estimating and Auditing Auditing is the methodical examination of financial accounts. wishes to estimate the total retail value of its tee-shirts. if the population is considered finite. that is the ratio of n/N 5%. . or operating processes to verify that they confirm with standard practices or targeted budget levels.9600 * 2.5% the value of z is 1.

Sample standard deviation.00 32. Mean retail value of books is £4.50 18.00 24. is 2.50 17. Sample size. From equation 7(xii) the estimated standard error is. There are 12 shelves of books and the owner estimates that there are 45 books per shelf. Degrees of freedom (n 1) is 28.50 27.0836 / 29.50 21. Estimate the total retail value of the books within a 95% confidence limit.303.500 – Estimated total retail value is N x 4.00 22.0582.500 * 25.50 42. Estimated population standard deviation.50 18. and the sample size is less than 30 we use the Student-t distribution.0976 . Using Excel [function TINV] for a 99% confidence level.00 52. is $11. – Estimated total retail value is N x 540 * 4.55. Nx Nt ˆ σ n 4.489 rounded). ˆ σ n N N n 1 0.00 15. is 12 * 45 or 540.00 44. is 28.244 Statistics for Business Thus the owner estimates the average.304 rounded) and $139.50 23. N. 1.55 Application of auditing for a finite population: paperback books A newspaper and bookstore at Waterloo Station wants to estimate the value of paper backed books it has in its store.33 (say $139.50 1. 16. 896.897 (rounded) and he is 99% confident that the value lies between $88.57 with a sample standard deviation of 53 pence. is $25.0582 or $139. x . N.57.78 (say $88.00 11. Sample standard deviation. Since we do not know the population standard deviation.50 16. 500 * 2.7633 * 2.00 20. ˆ σ N x Nt $113. From equation 7(xi) the lower confidence limit for the total value is. Estimated population standard deviation. Since this value is less than 5% we do not need to use the finite population multiplier.78 and the upper confidence limit is..489.00 9.19%.00 25. is £0.50 19. or point estimate. σ.00 37.31. Since this value is greater than 5% we use the finite population multiplier Finite population multiplier 540 28 540 1 512 539 N N n 1 0. 500 * 2.50 21. n. 489.00 29. ˆ is £0. Estimated standard error of the sample ˆ distribution.31 or $113.00 50.80.00 12.0836.33 $113.0836. 896. Estimate the total retail value of the clothing items within a 99% confidence limit.50 21.53 28 * 0. Population size.0582 or $88.9746. Ratio n/N is 29/4. Student-t value is 2.9746 0. s. σ. Table 7. n.00 32.50 29.57 or £2. 303.500 or 0.64%.55 n 4.467.50 25. of the total retail value of the clothing items in his Key West store as $113. Estimated population amount of books.7633.53. is 4. Using Excel [function AVERAGE] the sample – mean value. The owner takes a random sample of 28 books and determines that the average retail value is £4.896. s.53. ˆ is $11. Sample size. Ratio n/N is 28/540 or 5.3 Tee shirts – prices in $US.7633 * 2. is 29. σx / n 11.

we take a sample and say that our point estimate of the proportion expected to vote conservative in the next United Kingdom election is 37% and that we are 90% confident that the proportion will be in the range of 34% and 40%. we might be interested to estimate the proportion in the population.0518. The value – is p determined by taking a sample of size n and measuring the proportion of successes.80 (£2.64 From equation 7(xiii) the upper confidence limit is.467. Nx Nt ˆ σ N N n 1 540 * 2.576 rounded). is a point estimate of p the population proportion p. If we do not know the population proportion. then the standard error of the proportion can be estimated from the following equation by replacing p with –: p ˆ σp p (1 p ) n 7(xvi) ˆ In this case. σ p is the estimated standard error of the proportion and – is the sample proportion p of successes.Chapter 7: Estimating population characteristics 245 Degrees of freedom (n 1) is 27. analogous to the estimation for the means. 575. 467.80 £2. If we do this then equation 7(xv) is modified to give the expression. Nx Nt ˆ σ N N n 1 540 * 2.0518 * 0. p. Student-t value is 2. σ p : σp pq n p(1 p) n 6(xi) n £2.64 (say £2.468 rounded) and that she is 95% confident that the value lies between £2.575. Using Excel [function TINV] for a 95% confidence level. 359. When dealing with proportions then the sample proportion. Thus.0976 where n is the sample size and p is the population proportion of successes and q is the population proportion of failures equal to (1 p). From equation 7(xiii) the lower confidence limit is. 467.96 (say £2. –. Further. z p p p(1 p) n Reorganizing this equation we have the following expression for the confidence intervals for the estimate of the population proportion as follows: p p z p(1 p) n 7(xiv) n £2. p z p (1 p ) n p p z p (1 p ) 7(xvii) n .359. or point estimate.360 rounded) and £2. this implies that the confidence intervals for an estimate of the population proportion lie in the range given by the following expression: p z p(1 p) n p p z p(1 p) n 7(xv) Estimating the Proportion Rather than making an estimate of the mean value of the population.0976 Interval estimate of the proportion for large samples When analysing the proportions of a population then from Chapter 6 we developed the following equation 6(xi) for the standard error of the proportion. For example.80 £2.96 Thus the owner estimates the average. of the total retail value of the paper back books in the store as £2.0518 * 0. from equation 6(xv).

From equation 7(xvi) the estimate of the standard error of the proportion is.20 0. we can determine the sample size to take in order to estimate the population proportion for a given confidence level. 0. 1.70 0.5 gives the maximum possible value of 0.80 0.10 0.0900 0.90 1.246 Statistics for Business Sample size for the proportion for large samples In a similar way for the mean. ˆ σp p (1 p ) n 0.40 0.00 (1 p) p(1 p) Squaring both sides of the equation we have. the sample size the subject of the equation gives.0076 When we have a 90% confidence interval.2400 0. What is a 90% confidence interval for the proportion of all the defective circuit boards produced in this manufacturing process? Proportion defective. we can use a value of p equal to 0.97.70 0.30 0.030.50 0.80 0. and assuming a normal distribution.20 0.1600 0.4 and illustrated by the graph in Figure 7.30 0.00 0. (p p)2 z2 p(1 p) n 1.1600 0.10 0.2100 0.60 0. and the required sample error.8. p Alternatively. then the area of the distribution up to the lower confidence level is (100% 90%)/2 5% . when this is actually the value that we are trying to estimate! One possible approach is to use the value of – if this is available.90 0. This is because for a given value of the confidence level say 95% which defines z.2400 0.00 0.0900 0. e.4 Conservative value of p for sample size.97 500 0. n z2 p(1 p) e2 7(xx) While using this equation.00 0. p p z p(1 p) n 7(xviii) Table 7.0000 0. This is shown in Table 7.03 * 0. is 15/500 0. n z2 p(1 p) ( p p)2 7(xix) Application of estimation for proportions: Circuit boards In the manufacture of electronic circuit boards a sample of 500 is taken from a production line and of these 15 are defective. (p p) by e then the sample size is given by the relationship. –. p.0000 Making n.25 in the numerator of equation 7(xx). p Proportion that is good is 1 500 15 500 0.50 0.5 or 50% as this will give the most conservative sample size. then a value of p of 0. The following is an application of the estimation for proportions including an estimation of the sample size.2100 0. a question arises as to what value to use for the true population proportion.60 0.2500 0. p 0.030 or also – If we denote the sample error. From the relationship of 7(xiv) the intervals for the estimate of the population proportion are.40 0.0291 500 0.

Product.15 0.3263.0076 0.0425 5 boards which are defective is 0.25 0.90 0. p(1 0.75 0.65 0. then what size of sample should we take? When we have a 98% confidence interval.80 0.2250 0.00 Proportion.1750 0.95 1.8 Relation of the product.σ p 0. and assuming a normal distribution. value of z at the area of 95% is 1.6449 * 0. p ˆ z. If we required our estimate of the proportion of all the defective manufactured circuit boards to be within a margin of error of 0.Chapter 7: Estimating population characteristics 247 Figure 7. From equation 7(xvii) the lower confidence limit is.1500 0.0175 7 From equation 7(xvii) the upper confidence limit is. p and the area of the curve up to the upper confidence level is 5% 90% 95%.0125 0.05 0.0750 0.03 0.1000 0.2500 0. From Excel [function NORMSINV].85 0. p.75% and 0. then the area of the distribution up to the lower confidence level is (100% 98%)/2 1% and the area of the curve up to the upper confidence level is 1% 98% 99%.40 0.0250 0.03 1.0076 0.25%.0425 or 4.1250 0.03 0.30 0.03 1.50 0. we are 90% confident that this proportion lies in the range of 0.10 0. the proportion of all the manufactured circuit .2000 0.03 or 3%.01 at a 98% confidence level.20 0.70 0.60 0.0000 p) with the proportion. 2. From Excel [function NORMSINV].00 0.6449 * 0.6449.35 0.0175 or 1. value of z at the area of 5% is 1.6449. From the Excel normal distribution function we have the following.45 0. Thus we can say that from our analysis. value of z at the area of 1% is 2.2750 0.0500 0. Further.0125 0. p(1 p) 0.55 0. p ˆ zσ p 0. From Excel [function NORMSINV].

there will be a margin of error. more conservative approach is to use a value of p 0.3263.500 is significantly higher than 1. This is not to say that we have made a calculation error. “Why don’t we always . Using equation 7(xx).3263 * 2. let us analyse a bigger sample in order to obtain a smaller margin of error. when we have a given standard deviation and a given confidence level the only term that can change is the sample size n. In this case the sample size to use is.03 * 0. Thus.575. n z2 p(1 p) e2 0. e.2500 0. If we double the sample size from 60 to 120 units the ratio of 1/ n changes from 12. but as can be seen from Figure 7.03. Now if we look at equation 7(xxi). although this can occur.13% or a difference 7(xxi) 2. Thus the sample size to estimate the population proportion of the number of defective circuits within an error of margin of error of 0. is 0. 2. but the margin of error measures the maximum amount that our estimate is expected to differ from the actual population parameter.3263 * 0. Increasing the sample size does reduce the margin of error but at a decreasing rate. value of z at the area of 99% is 2.9986 cm and we have a margin of error of 0. 500 This value of 2.248 Statistics for Business From Excel [function NORMSINV].50 0 0.0001 1.3263 or 2.01 * 0. Explaining margin of error When we analyse our sample we are trying to estimate the population parameter.9. since we are squaring z and the negative value becomes positive. there is a diminishing return.9600 * 0. another way of reporting our results is to say that we estimate that the width of all the computer paper from the production line is 20.0013 or 0. n z2 p(1 p) e2 0.575 and would certainly add to the cost of the sampling experiment with not necessarily a significant gain in the accuracy of the results.1575 0. Thus we might say.97 0 0.01 2. This is true.0025 cm at a 95% confidence. The sample proportion – is used for the popp ulation proportion p or 0.01 * 0.01. either the mean value or the proportion. which gives the ratio of 1/ n as a percentage according to the sample size in units. at a confidence level of 95%. Margin of Error and Levels of Confidence When we make estimates the question arises (or at least it should) “How good is your estimate?” That is to say.91% to 9. When we do this. we might ask. An alternative.50 * 0. the margin of error is 1.5.01 from the true proportion is 1.3263 * 2.3263.01 use a high confidence level of say 99% as this would signify a high degree of accuracy?” These two issues are related and are discussed below. If we are estimating the mean value then. The sample error.3263 * 0. what is the margin of error? In addition. In the worked example paper.0001 2.0025 cm. 575 It does not matter which value of z we use. Margin of error is z σx n This is the same as the confidence limits from equation 7(i). The margin of error is a plus or minus value added to the sample result that tells us how good is our estimate.

6449 Since for proportions we are trying to estimate the percentage for a situation then the margin of error is a plus or minus percentage.25% 1. If we look at Figure 7. The margin of error quoted in a sampling situation is important as it can give uncertainty to our conclusions.1.78%.50%.39%. In the This means that our estimate could be 1.13% to 7. 14 13 12 11 10 9 1/ n (%) 8 7 6 5 4 3 2 1 0 0 60 120 180 240 300 360 420 480 540 600 660 720 780 840 900 960 Sample size.68% or. With the increasing sample size the cost of testing of course increases and so there has to be a balance between the size of the sample and the cost. n units of 3.9 The change of 1/ n with increase of sample size. Based on just this information we might conclude that the majority of the Italians are against Turkey’s membership. if we then bring in the 3% margin of error then this means that we can .97 500 0.Chapter 7: Estimating population characteristics 249 Figure 7.03 * 0.25% less than our estimated proportion or a range of 2.25% more or 1.45% or a difference of 1. However.0125 1.27% to 4. for example. if we go from a sample size of 360 to 420 units the value of 1/ n goes from 5. From a sample size of 120 to 180 the value of 1/ n changes from 9. ˆ σp z p (1 p ) n 0.88% or a difference of only 0. we see that 52% of the Italian population is against Turkey joining the European Union. If we are estimating for proportions then the margin of error is from equation 7(xvii) the value. z p (1 p ) n 7(xxii) worked example circuit boards the margin of error at a 90% level of confidence is.

At 18 months there is a 50% confidence if there are.50 years have 49% against Turkey joining the Union (52 3). However this is not the case since in order to have high confidence levels we need to have large confidence intervals or a large margin of error. which is not now the majority of the population.250 Statistics for Business Table 7. the margin of error must be taken into account when surveys are made because the result could change. You are concerned about the time taken to complete the project and you ask the constructor various questions concerning the time frame. for example. At 6 months we are essentially saying it is impossible. Will my house be finished in 2 years? 4. for a house to be finished in 10 years the constructor is almost certain because this is an inordinate amount of time and so we have put a confidence level of 99%. Again to ask the question for 5 years the confidence level is high at 95%. In this case the large intervals give very broad or fuzzy estimates. (The time to completely construct a house varies with location but some 18 months to 2 years to build and completely finish all the landscaping is a reasonable time frame. give the implied confidence interval and the implied confidence level.) . These are given in the 1st column of Table 7. Assume that you have contracted a new house to be built of 170 m2 living space on 2. This can be illustrated qualitatively as follows. Constructor’s response I am certain I am pretty sure I think so Possibly Probably not Implied confidence interval 99% 95% 80% About 50% About 1% Implied confidence level 10 5 2 years years years Your question 1.5 Questions asked in house construction. Possible indicated responses to these are given in the 2nd column and the 3rd and 4th columns. If the margin of error was included in the survey result of the Dewey/Truman election race. respectively. Thus. as presented in the Box Opener of Chapter 6. Will my house be finished in 10 years? 2. At 2 years there is a confidence level of 80% if everything goes better than planned. the Chicago Tribune may not have been so quick to publish their morning paper! Confidence levels If we have a confidence level that is high say at 99% the immediate impression is to think that we have a high accuracy in our sampling and estimating process. Will my house be finished in 6 months? 1. Thus. Will my house be finished in 5 years? 3.5. Our conclusions are reversed and in cases like these we might hear the term for the media “the results are too close to call”.5 years 0. ways to expedite the work.500 m2 of land. Will my house be finished in 18 months? 5.

this can be calculated from the interval equation since the number of standard deviations. As we increase the size of the sample the value of the Student-t approaches the value z and so in this case we can use the normal distribution relationship. by the total value of the population. When our population is finite. The degree of freedom is the sample size less one. This is then multiplied by the number of standard deviations in order to determine the confidence intervals. A more objective analysis is to give a range of the estimate and the probability. to be correct we must use a Student-t distribution. we correct our standard error by multiplying by the finite population multiplier. Estimating the proportion If we are interested in making an estimate of the population proportion we first determine the standard error of the proportion by using the population value. and then multiply this by the number of standard deviations to give our confidence limits. The wider the confidence interval then the higher is our confidence and vice-versa. z. estimating the population proportion.Chapter 7: Estimating population characteristics 251 This chapter has covered estimating the mean value of a population using a normal distribution and a Student-t distribution. To do this we multiply both the average value obtained from our sample. When we do this in sampling from an infinite normal distribution we use the standard error. The Student-t distributions are a family of curves. Estimating the mean using the Student-t distribution When we have a sample size that is less than 30. The standard error is the population standard deviation divided by the square root of the sample size. If we have a finite population we must modify the standard error by the finite population multiplier. and we do not know the population standard deviation. and the confidence interval. is set by our level of confidence. Estimating and auditing The activity of estimating can be extended to auditing financial accounts or values of inventory. for a given confidence interval. that we have in this estimate. using estimating for auditing purposes. each one being a function of the degree of freedom. or the confidence. and discussed the margin of error and confidence intervals. Chapter Summary Estimating the mean value We can estimate the population mean by using the average value taken from a random sample. Since it is unlikely that we know the population standard deviation in our audit experiment we use a Student-t distribution and use the sample standard deviation in order to estimate our population standard deviation. If we wish to determine a required sample size. However this single value is often insufficient as it is either right or wrong. similar in profile to the normal distribution. This is a point estimate. If we do not have a value of the population . When we do not know the population standard deviation we must use the sample standard deviation as an estimate of the population standard deviation in order to calculate the confidence intervals.

as we increase the size of the sample the cost of our sampling experiment increases and there is a diminishing return on the margin of error with sample size. In order to have a high confidence level we need to have broader confidence limits and this leads to rather vague or fuzzy estimates.252 Statistics for Business proportion then we use the sample value of the proportion to estimate our standard error. The larger the sample size then the smaller is the margin of error. Although at first it might appear that a high confidence level of say close to 100% indicates a high level of accuracy. The most conservative sample size will be when the value of the proportion p has a value of 0. However.5 or 50%. . We can determine the sample size for a required confidence level by reorganizing the confidence level equation to make the sample size the subject of the equation. Margin of error and levels of confidence In estimating both the mean and the proportion of a population the margin of error is the maximum amount of difference between the value of the population and our estimated amount. this is not the case.

confidence intervals for the mean length of the life of light bulbs. Required 1. 5. About how many cases would have to be selected such that you would be within 2 g of the population mean value? 5. One of its production lines is for filling 500 g squeeze bottles. Ketchup Situation A firm manufactures and bottles tomato ketchup that it then sells to retail firms under a private label brand. 3. Light bulbs Situation A subsidiary of GE manufactures incandescent light bulbs. which after being filled are fed automatically into packing cases of 20 bottles per case. In the filling operation the firm knows that the standard deviation of the filling operation is 8 g. In a randomly selected case what would be the 99% confidence intervals for the mean weight of ketchup in a case? 3. Explain the differences between the answers to Questions 1 and 2. what would be the 95% confidence intervals for the mean weight of ketchup in a case? 2. 3. Ski magazine Situation The publisher of a ski magazine in France is interested to know something about the average annual income of the people who purchase their magazine. and 4. Over a period of . Determine the 99% confidence intervals for the mean length of the life of light bulbs. Explain the differences between Questions 1.Chapter 7: Estimating population characteristics 253 EXERCISE PROBLEMS 1. 342 426 317 545 264 451 1.049 631 512 266 492 562 298 Required 1. Determine the 90%. What are your comments about this sampling experiment from the point-of-view of randomness? 2. 4. 4. 2. The number of hours each burned before failure is given below. In a randomly selected case. How would you explain the concept illustrated by Question 1? 3. Determine the 80% confidence intervals for the mean length of the life of light bulbs. The manufacturer sampled 13 bulbs from a lot and burned them continuously until they failed.

000 $43. 3. Households Situation A random sample of 121 households indicated they spent on average £12 on take-away restaurant foods.845 and the standard deviation of this sample is €8. the Tax Commissioner took a random sample of 15 tax returns.000 $12.000 $7. Required 1. The standard deviation of this sample was £3.000 $16. 5. The taxes paid in $US according to these returns were as follows: $34. Calculate a 95% confidence interval for the average amount spent by all households in the population. Calculate a 90% confidence interval for the average amount spent by all households in the population.000 $6. Determine the 99% confidence intervals of the mean income of all the magazine readers of this ski magazine? 3. Explain the differences between the answers to Questions 1–3. Determine the 90% confidence intervals of the mean income of all the magazine readers of this ski magazine? 2. Taxes Situation To estimate the total annual revenues to be collected for the State of California in a certain year.000 $9.000 $72. Using for example the 95% confidence interval.000 Required 1. and 99% confidence intervals for the mean tax returns.000 $12.542.000 $23. 4. Calculate a 98% confidence interval for the average amount spent by all households in the population. 95%. Determine the 80%.000 $19. they determine that the average income is €39.000 $39.000 $15. 2. How would you explain the difference between the answers to Questions 1 and 2? 4. how would you present your analysis to your superior? . 2.254 Statistics for Business three weeks they take a sample and from a return of 758 subscribers.000 $0 $2. Required 1.

He samples at random 75 of the grape vines and finds that there is a mean of 15 grape bunches per vine. Required 1.200 grape vines. Would your answer change if you used a Student-t distribution rather than a normal distribution? 7. 2. This would mean that you would be 90% confident that the mean amount of imperfections lies within this range. This information is given in the table below. How do you explain the differences in these intervals and what does it say about confidence in decision-making? 6. The store will sell these at a marked-down price and it knows from past experience that it will have no problem selling these tiles as customers purchase these for tiling a basement or garage where slight imperfections are not critical. France. Imperfect means that the colour may not be uniform. with a sample standard deviation of 6. Determine a 90% confidence interval for the mean amount of imperfections on the floor tiles. Construct a 95% confidence limit for the bunch of grapes for the total of 5. Floor tiles Situation A hardware store purchases a truckload of white ceramic floor tiles from a supplier knowing that many of the tiles are imperfect.200 grape vines. a farmer is interested to estimate the yield from his 5. Vines Situation In the Beaujolais wine region north of Lyon. there may be surface hairline cracks. what is an estimate of the mean number of imperfections on the lot of white tiles? This would be a point estimate. 7 4 5 3 8 4 3 5 1 2 1 3 2 6 3 2 2 3 7 4 3 8 1 5 8 Required 1. or there may be air pockets on the surface finish. A store employee takes a random sample of 25 tiles from the storage area and counts the number of imperfections. 2. . How would you express the values determined in the previous question? 3. To the nearest whole number.Chapter 7: Estimating population characteristics 255 3. What is an estimate of the standard error of the number of imperfections on the tiles? 3.

324.920. What is your explanation of the difference between the limits obtained in Questions 3 and 4? 8.0 16.2 53.8 60.6 28. and the headquarters of the firm. stock holders equity.7 77.8 70.486. Europe Edition.9 36. 23 July 2007.733. 84.9 47. assets.829.5 32.0 16.466.520.0 24. number of employees.5 20. The following table gives a random sample of the revenues of 35 of those 500 firms for 2006. profits.017.4 Country United Kingdom Netherlands Switzerland United States United States United States Australia United States United Kingdom Switzerland Japan Spain United States Italy Sweden United States United States China South Korea United States Japan Japan United States Canada Switzerland France Japan The World’s Largest Corporations.746.0 25.934.0 24.1 25.0 107.2 Company Royal Mail Holdings Rabobank Swiss Reinsurance DuPont Liberty Mutual Insurance Coca-Cola Westpac Banking Northwestern Mutual Lloyds TSB Group UBS Sony Repsol YPF United Technologies San Paolo IMI Vattenfall Bank of America Kimberly-Clark State Grid SK Networks Archer Daniels Midland Bridgestone Matsushita Electric Industrial Johnson and Johnson Magna International Migros Bouygues Hitachi 2 Revenues ($millions) 16. Fortune.3 19.871. This would mean that you would be 99% confident that the mean amount of imperfections lies within this range. generated using the random function in Excel.693. World’s largest companies Situation Every year Fortune magazine publishes information on the world’s 500 largest companies.153.596.7 87.7 36.4 33. This information includes revenues.1 53.726.117.185.088. Determine a 99% confidence interval for the mean amount of imperfections on the floor tiles.5 16. .709.615.180.0 16. 5.0 22.170.768.256 Statistics for Business 4.904. 156(2).982.793.6 117.924.9 107. p.

3 19.9 17.1 59. Using the complete sample data. wrong billing of restaurant meals consumed. what is an estimate for the standard error? 8.397. This would mean that you would be 95% confident that the average revenue lies within this range. 9.360. This would mean that you would be 95% confident that the average revenue lies within this range. determine a 95% confidence interval for the mean value of revenues for the world’s 500 largest companies. Using the complete sample data.1 22. what is an estimate for the standard error? 3. determine a 99% confidence interval for the mean value of revenues for the world’s 500 largest companies.559. what is an estimate for the average value of revenues for the world’s 500 largest companies? 2.690. Using the first 15 pieces of data. These complaints included overcharging on items taken from the refrigerator in the room. Using the complete sample data. Using the first 15 pieces of data.5 81. Hotel accounts Situation A 125-room hotel noted that in the morning when clients check out there are often questions and complaints about the amount of the bill. Explain the difference between in the answers obtained in Questions 8 and 9. 5. determine a 99% confidence interval for the mean value of revenues for the world’s 500 largest companies. On a particular day the hotel . 4. Using the first 15 pieces of data. 11. and incorrect accounts of laundry items. Explain the difference between the answers obtained in Questions 3 and 4. 10.895. 6.6 25.Chapter 7: Estimating population characteristics 257 Company Mediceo Paltac Holdings Edeka Zentrale Unicredit Group Otto Group Cardinal Health BAE Systems TNT Tyson Foods Revenues ($millions) 18.119. This would mean that you would be 95% confident that the average revenue lies within this range. This would mean that you would be 95% confident that the average revenues lie within this range. Using the complete sample data. Explain the differences between the results in Questions 1 through 4 and those in Questions 6 through 9 and justify how you have arrived at your results. give an estimate for the average value of revenues for the world’s 500 largest companies? 7.9 20. determine a 95% confidence interval for the mean value of revenues for the world’s 500 largest companies. Using the first 15 pieces of data.524. 9.0 Country Japan Germany Italy Germany United States United Kingdom Netherlands United States Required 1.733.

How would you express the answers to Questions 1 and 2 to management? 4. 3. 44 88 69 80 61 34 76 55 75 41 66 68 72 57 32 48 34 88 36 62 42 89 60 95 91 36 73 74 50 65 Required 1. Automobile tyres Situation An automobile repair company has an inventory of 2.75 with a standard deviation of $0.53. This sample information in Euros is given in the table below. 5.8 errors on these sample accounts. .500 different sizes. Stuffed animals Situation A toy store in New York estimates that it has 270 stuffed animals in its store at the end of the week. Determine a 99% confidence interval for the cost price of the automobile tyres in inventory. 3. and 4? 10. 3. 4. and different makes of tyres. 5. An assistant takes a random sample of 19 of these stuffed animals and determines that the average retail price of these animals is $13.258 Statistics for Business is full and the night manager analyses a random sample of 19 accounts and finds that there is an average of 2. 2. It wishes to estimate the value of this inventory and so it takes a random sample of 30 tyres and records their cost price. Explain the differences between Questions 2 and 4? 6. What is an estimation of the cost price of the total amount of tyres in inventory? 2. Based on passed analysis the night manager believes that the population standard deviation is 0. How would you suggest a random sample of tyres should be taken from inventory? What other comments do you have? 11. Determine a 95% confidence interval for the cost price of the automobile tyres in inventory. Required 1.7. From this sample experiment. what is the correct value of the standard error? What are the confidence intervals for a 90% confidence level? What are the confidence intervals for a 95% confidence level? What are the confidence intervals for a 99% confidence level? Explain the differences between Questions 2.

005 of the population proportion at 90% confidence were required. is considering the introduction of a night shift.005 of the population proportion at 98% confidence were required. 4. What are your comments about the answer obtained in Question 4 and 5 and in general terms for this sampling process. What is a point estimate of the proportion of shampoo bottles that are defective in the production operation? 2. If an estimate of the proportion of defectives to within a margin of error of 0. At the end of the production operation the bottles pass through an optical quality control detector.Chapter 7: Estimating population characteristics 259 Required 1. how many bottles should pass through the optical detector? No information is available from past data. Explain the difference between Questions 3 and 4. and you wanted to be conservative in you analysis.000 employees. 3. how many bottles should pass through the optical detector? No information is available from past data. Any bottle that the detector finds defective is automatically ejected from the line. 17 were ejected. Night shift Situation The management of a large factory. Required 1.500 bottles that passed the optical detector. 13. 5. What is the correct value of the standard error of the sample? 2. If an estimate of the proportion of defectives to within a margin of error of 0. Shampoo bottles Situation A production operation produces plastic shampoo bottles for Procter and Gamble. where there are 10. . 12. 5. 4. Give a 95% confidence limit of the total retail value of all the stuffed animals in inventory. 6. Give a 99% confidence limit of the total retail value of all the stuffed animals in inventory. The human resource department took a random sample of 800 employees and found that there were 240 who were not in favour of a night shift. Obtain 90% confidence intervals for the proportion of defective bottles produced in production. In 1. Obtain 98% confidence intervals for the proportion of defective bottles produced in production. What is an estimate of the total value of the stuffed animals in the store? 3. and you wanted to be conservative in you analysis.

30 December 2005. England. 3 .. based in Watford.260 Statistics for Business Required 1. What is the proportion of employees who are in favour of a night shift? 2. 1. What are the 95% confidence limits for the population who are not in favour? 3. Ski trip Situation The Student Bureau of a certain business school plans to organize a ski trip in the French Alps. This transaction will create a worldwide empire of 2. What are the 98% confidence limits for the population who are not in favour? 5. Required 1. agreed in December 2005 to sell the international Hilton properties for £3. What is your explanation of the difference between Questions 3 and 5? 14.02 at a confidence level of 90%? No other sample information has been taken. H.800 hotels stretching from the WaldorfAstoria in New York to the Phuket Arcadia Resort in Thailand. What are the 98% confidence limits for the proportion who are in favour of a night shift? 6. 15. Hilton hotels Situation Hilton hotels. Obtain 90% confidence intervals for the proportion of students who will be coming skiing. Obtain 98% confidence intervals for the proportion of students who will be coming skiing.3 The objective of this new Timmons.3 billion to United States-based Hilton group. 4. What is an estimate of the proportion of students who say they will not be coming skiing? 2. What would be the conservative value of the sample size in order that the Student Bureau can estimate the true proportion of those coming skiing within plus or minus 0. 6. p. International Herald Tribune. What would be the conservative value of the sample size in order that the Student Bureau can estimate the true proportion of those coming skiing within plus or minus 0. 3. The bureau selects a random sample of 40 students and of these 24 say they will be coming skiing.000 students in the school. What are the 95% confidence limits for the proportion who are in favour of a night shift? 4. How would you explain the different between the answers to Questions 2 and 3? 5.02 at a confidence level of 98%? No other sample information has been taken. “Hilton sets the stage for global expansion”. There are 5.

the food processor is to be sold in a total of 100 stores in the six countries where the test market had been carried out. and Barcelona. The company made a test market. and from day to day. a member of the finance department takes a random sample of 49 hotels worldwide and finds that in a 3-month test period. during the first 3 months that this product was on sale. In order to test whether the objectives are able to be met. For the first year of commercialization after the “go” decision. What is an estimate of the proportion or percentage of the population of hotels that meet the objectives of the chain? 2. Required 1. Bergen. The weekly test market sales for these outlets are given in the table below. or a yield rate. What is a 98% confidence interval for the proportion of hotels who meet the objectives of the chain? 4. for which it has not yet decided to go into full commercialization. is a new computerized food processor. One of its new products. Spain. Norway. Limoges. Management wanted to use a confidence level of 90% in its analysis. Six stores were chosen for this study in the European cities of Milan. United Kingdom. because their Accounting Department had indicated that at least 130. Birmingham. Germany. Case: Oak manufacturing Situation Oak manufacturing company produces kitchen appliances.000 of this food processor need to be sold in the first year of commercialization to break-even. What are your comments about this sample experiment that might explain inconsistencies? 16. They reasonably assumed that daily sales were independent from country to country. 7. of all the hotels at least 90%. What would be the conservative value of the sample size that should be taken in order that the hotel chain can estimate the true proportion of those meeting the objectives is within plus or minus 10% of the true proportion at a confidence level of 98%? No earlier sample information is available. Hamburg.Chapter 7: Estimating population characteristics 261 chain is to have an average occupancy. How would you explain the difference between the answers to Questions 2 and 3? 5. Oak had developed this survey. Italy. 6. store to store. France. . What would be the conservative value of the sample size that should be taken in order that the hotel chain can estimate the true proportion of those meeting the objectives is within plus or minus 10% of the true proportion at a confidence level of 90%? No earlier sample information is available. 32 of these had an occupancy rate of at least 90%. What is a 90% confidence interval for the proportion of hotels who meet the objectives of the chain? 3. which it sells on the European market.

Germany 29 29 13 22 23 20 29 17 22 26 19 21 47 31 33 42 32 13 19 23 20 20 17 34 Limoges.262 Statistics for Business Milan. Italy 3 8 20 8 17 11 12 3 6 13 12 13 15 0 15 5 2 17 19 18 17 12 17 6 Hamburg. France 15 16 32 31 32 15 16 46 27 20 28 2 28 29 36 33 18 33 28 27 34 16 30 32 Birmingham. Spain 21 0 5 14 16 9 13 11 3 16 4 1 15 18 6 18 14 21 14 20 19 9 12 1 Required Based on this information what would be your recommendations to the management of Oak manufacturing? . United Kingdom 34 22 31 28 23 20 26 39 24 35 37 20 27 30 34 25 21 26 16 31 23 25 12 22 Bergen. Norway 25 19 25 35 25 20 34 29 24 33 36 39 38 12 33 26 35 30 28 34 20 29 20 36 Barcelona.

simply by intuition. . that radiation levels in the area are abnormally high? Alternatively. Three people in the area died of leukaemia. Does the death of three people make us assume that the government is wrong with its information and that we make the assumption. The local people immediately put the blame on the radioactive fallout.Hypothesis testing of a single population 8 You need to be objective The government in a certain country says that radiation levels in the area surrounding a nuclear power plant are well below levels considered harmful. or hypothesis. For this situation an appropriate action would be to take representative samples of the incidence of leukaemia cases over a reasonable time period and use these to test the hypothesis. do we accept that the deaths from leukaemia are random and are not related to the nuclear power facility? You should not accept. You need to be objective in decision-making. This is the purpose of this chapter (and the following chapter) to find out how to use hypothesis testing to determine whether a claim is valid. a hypothesis about a population parameter – in this case the radiation levels in the surrounding area of the nuclear power plant. There are many instances when published claims are not backed up by solid statistical evidence. or reject.

000 attendees. A contractor says that it will take 9 months to construct a house for a client. The completion time is not 9 months however it is not significantly different from the estimated time construction period of 9 months. The topics of these themes are as follows: ✔ ✔ ✔ ✔ ✔ Concept of hypothesis testing • Significance level • Null and alternative hypothesis Hypothesis testing for the mean value • A two-tail test • One-tail. right-hand test • One-tail. A financial advisor estimates that a client will make $15. about situations. hypothesis testing is an extension of the use of sampling presented in Chapter 6. or wrong. Thus like estimating. we need to decide what we consider is the significance level or the level of importance in our evaluation. The house is Thus in hypothesis testing. This significance level is giving a ceiling level usually in terms of . how to test for the mean and proportion and be aware of the risks in testing.000 on a certain investment. However. we are either right. if the client made only $8.000 people at an open air rock concert. Thus our hypothesis may be acceptable. or hypotheses.000 and has a justified reason to say that he was given bad advice.000 is significantly different from 20. or population parameter based simply on an assumption or intuition with no concrete backup information or analysis. However.900. Ticket receipts indicate there are 42.900 is not $15.000. outcome. The local authorities estimate that there are 20. The number $14.500 he would probably say that this is significantly different from the estimated $15. Hypothesis testing is to take sample data and make on objective decision based on the results of the test within an appropriate significance level.000 and the client really does not have a strong reason to complain.264 Statistics for Business Learning objectives After you have studied this chapter you will understand the concept of hypothesis testing.000 but it is not significantly different from $15. left-hand test • Acceptance or rejection • Test statistics • Application when the standard deviation of the population is known: Filling machine • Application when the standard deviation of the population is unknown: Taxes Hypothesis testing for proportions • Testing for proportions from large samples • Application of hypothesis testing for proportions: Seaworthiness of ships The probability value in testing hypothesis • p-value of testing hypothesis • Application of the p-value approach: Filling machine • Application of the p-value approach: Taxes Application of the p-value approach: Seaworthiness of ships • Interpretation of the p-value Risks in hypothesis testing • Errors in hypothesis testing • Cost of making an error • Power of a test Concept of Hypothesis Testing A hypothesis is a judgment about a situation. Consider the following: ● finished in 9 months and 1 week. This number of 42. ● ● Significance level When we make quantitative judgments. if we are wrong we may not be far from the real figure or that is our judgment is not significantly different. The client makes $14.

Chapter 8: Hypothesis testing of a single population percentages such as 1%, 5%, 10%, etc. To a certain extent this is the subjective part of hypothesis testing since one person might have a different criterion than another individual on what is considered significant. However in accepting, or rejecting a hypothesis in decision-making, we have to agree on the level of significance. This significance value, which is denoted as alpha, α, then gives us the critical value for testing.

265

**Hypothesis Testing for the Mean Value
**

In hypothesis testing for the mean, an assumption is made about the mean or average value of the population. Then we take a sample from this population, determine the sample mean value, and measure the difference between this sample mean and the hypothesized population value. If the difference between the sample mean and the hypothesized population mean is small, then the higher is the probability that our hypothesized population mean value is correct. If the difference is large then the smaller is the probability that our hypothesized value is correct.

**Null and alternative hypothesis
**

In hypothesis testing there are two defining statements premised on the binomial concept. One is the null hypothesis, which is that value considered correct within the given level of significance. The other is the alternative hypothesis, which is that the hypothesized value is not correct at the given level of significance. The alternative hypothesis as a value is also known as the research hypothesis since it is a value that has been obtained from a sampling experiment. For example, the hypothesis is that the average age of the population in a certain country is 35. This value is the null hypothesis. The alternative to the null hypothesis is that the average age of the population is not 35 but is some other value. In hypothesis testing there are three possibilities. The first is that there is evidence that the value is significantly different from the hypothesized value. The second is that there is evidence that the value is significantly greater than the hypothesized value. The third is that there is evidence that the value is significantly less than the hypothesized value. Note, that in these sentences we say there is evidence because as always in statistics there is no guarantee of the result but we are basing our analysis of the population based only on sampling and of course our sample experiment may not yield the correct result. These three possibilities lead to using a two-tail hypothesis test, a right-tail hypothesis test, and a left-tail hypothesis test as explained in the next section.

**A two-tail test
**

A two-tail test is used when we are testing to see if a value is significantly different from our hypothesized value. For example in the above population situation, the null hypothesis is that the average age of the population is 35 years and this is written as follows: Null hypothesis: H0: μx 35 8(i)

In the two-tail test we are asking, is there evidence of a difference. In this case the alternative to the null hypothesis is that the average age is not 35 years. This is written as, Alternative hypothesis: H1: μx 35 8(ii)

When we ask the question is there evidence of a difference, this means that the alternative value can be significantly lower or higher than the hypothesized value. For example, if we took a sample from our population and the average age of the sample was 36.2 years we might say that the average age of the population is not significantly different from 35. In this case we would accept the null hypothesis as being correct. However, if in our sample the average age was

266

Statistics for Business 52.7 years then we may conclude that the average age of the population is significantly different from 35 years since it is much higher. Alternatively, if in our sample the average age was 21.2 years then we may also conclude that the average age of the population is significantly different from 35 years since it is much lower. In both of these cases we would reject the null hypothesis and accept the alternative hypothesis. Since this is a binomial concept, when we reject the null hypothesis we are accepting the alternative hypothesis. Conceptually the two-tailed test is illustrated in Figure 8.1. Here we say that there is a 10% level of significance and in this case for a two-tail test there is 5% in each tail. than our hypothesized value. For example in the above population situation, the null hypothesis is that the average age of the population is equal to or less than 35 years and this is written as follows: Null hypothesis: H0: μx 35 8(iii)

The alternative hypothesis is that the average age is greater than 35 years and this is written as, Alternative hypothesis: H1: μx 35 8(iv)

**One-tail, right-hand test
**

A one-tail, right-hand test is used to test if there is evidence that the value is significantly greater

Thus, if we took a sample from our population and the average age of the sample was say 36.2 years we would probably say that the average age of the population is not significantly greater than 35 years and we would accept the null hypothesis. Alternatively, if in our sample the average age was 21.2 years then although this is significantly less than 35, it is not greater than 35. Again we would accept the null hypothesis. However, if in

Figure 8.1 Two-tailed hypothesis test.

Question being asked, “Is there evidence of a difference?”

**“Is there evidence that the average age is not 35?” H0 : x 35 H1 :
**

x

35

If sample means falls in this region we accept the null hypothesis

At a 10% significance level, there is 5% of the area in each tail

35

Reject null hypothesis if sample mean falls in either of these regions

Chapter 8: Hypothesis testing of a single population our sample the average age was 52.7 years then we may conclude that the average age of the population is significantly greater than 35 years and we would reject the null hypothesis and accept the alternative hypothesis. Note that for this situation we are not concerned with values that are significantly less than the hypothesized value but only those that are significantly greater. Again, since this is a binomial concept, when we reject the null hypothesis we accept the alternative hypothesis. Conceptually the one-tail, right-hand test is illustrated in Figure 8.2. Again we say that there is a 10% level of significance, but in this case for a one-tail test, all the 10% area is in the right-hand tail. us consider the above population situation. The null hypothesis, H0: μx, is that the average age of the population is equal to or more than 35 years and this is written as follows: H0: μx 35 8(v)

267

The alternative hypothesis, H1: μx, is that the average age is less than 35 years. This is written, H1: μx 35 8(vi)

**One-tail, left-hand test
**

A one-tail, left-hand test is used to test if there is evidence that the value is significantly less than our hypothesized value. For example again let

Thus, if we took a sample from our population and the average age of the sample was say 36.2 years we would say that there is no evidence that the average age of the population is significantly less than 35 years and we would accept the null hypothesis. Or, if in our sample the average age was 52.7 years then although this is significantly greater than 35 it is not less than 35 and we would accept the null hypothesis. However, if in our sample the average age was 21.2 years then we may conclude that the average age of the

Figure 8.2 One-tailed hypothesis test (right hand).

Question being asked, “Is there evidence of something being greater?”

If sample means falls in this region we accept the null hypothesis

“Is there evidence that the average age is greater than 35?” H0 : x 35 H1 : x 35

At a 10% significance level, all the area is in the right tail

35

Reject null hypothesis if sample mean falls in this region

268

Statistics for Business population is significantly less than 35 years and we would reject the null hypothesis and accept the alternative hypothesis. Note that for this situation we are not concerned with values that are significantly greater than the hypothesized value but only those that are significantly less than the hypothesized value. Again, since this is a binomial concept, when we reject the null hypothesis we accept the alternative hypothesis. Conceptually the one-tail, left-hand test is illustrated in Figure 8.3. With the 10% level of significance shown means that for this one-tail test all the 10% area is in the left-hand tail. the hypothesized population mean. If we test at the 10% significance level this means that the null hypothesis would be rejected if the difference between the sample mean and the hypothesized population mean is so large than it, or a larger difference would occur, on average, 10 or fewer times in every 100 samples when the hypothesized population parameter is correct. Assuming the hypothesis is correct, then the significance level indicates the percentage of sample means that are outside certain limits. Even if a sample statistic does fall in the area of acceptance, this does not prove that the null hypothesis H0 is true but there simply is no statistical evidence to reject the null hypothesis. Acceptance or rejection is related to the values of the test statistic that are unlikely to occur if the null hypothesis is true. However, they are not so unlikely to occur if the null hypothesis is false.

Acceptance or rejection

The purpose of hypothesis testing is not to question the calculated value of the sample statistic, but to make an objective judgment regarding the difference between the sample mean and

Figure 8.3 One-tailed hypothesis test (left hand).

Question being asked, “Is there evidence of something being less than?” “Is there evidence that the average age is less than 35?” H0 : x 35 H1 : x 35

If sample means falls in this region we accept the null hypothesis

At a 10% significance level, all the area is in the left tail

35

Reject null hypothesis if sample mean falls in this region

Chapter 8: Hypothesis testing of a single population

269

Test statistics

We have two possible relationships to use that are analogous to those used in Chapter 7. If the population standard deviation is known, then using the central limit theorem for sampling, the test statistic, or the critical value is, x test statistics, z σx μH

●

●

● ●

0

n

8(vii)

●

– The numerator, x μH0, measures how far, the observed mean is from the hypothesized mean. ˆ σx is the estimate of the population standard deviation and is equal to the sample standard deviation, s. n is the sample size. ˆ σx n , the denominator in the equation, is the estimated standard error. t, is how many standard errors, the observed sample mean is from the hypothesized mean.

Where,

● ● ●

● ● ●

●

μH0 is the hypothesized population mean. – x is the sample mean. – The numerator, x μx, measures how far, the observed mean is from the hypothesized mean. σx is the population standard deviation. n is the sample size. σx n , the denominator in the equation, is the standard error. z, is how many standard errors, the observed sample mean is from the hypothesized mean.

The following applications illustrate the procedures for hypothesis testing.

**Application when the standard deviation of the population is known: Filling machine
**

A filling line of a brewery is for 0.50 litre cans where it is known that the standard deviation of the filling machine process is 0.05 litre. The quality control inspector performs an analysis on the line to test whether the process is operating according to specifications. If the volume of liquid in the cans is higher than the specification limits then this costs the firm too much money. If the volume is lower than the specifications then this can cause a problem with the external inspectors. A sample of 25 cans is taken and the average of the sample volume is 0.5189 litre. 1. At a significance level, α, of 5% is there evidence that the volume of beer in the cans from this bottling line is different than the target volume of 0.50 litre? Here we are asking the question if there is there evidence of a difference so this means it is a two-tail test. The null and alternative hypotheses are written as follows: Null hypothesis: H0: μx 0.50 litre. 0.50 litre.

If the population standard deviation is unknown then the only standard deviation we can determine is the sample standard deviation, s. This value of s can be considered an estimate of the population standard deviation sometimes ˆ written as σx. If the sample size is less than 30 then we use the Student-t distribution, presented in Chapter 7, with (n 1) degrees of freedom making the assumption that the population from which this sample is drawn is normally distributed. In this case, the test statistic can be calculated by, x t Where,

●

μH

0

ˆ σx

8(viii)

n

●

μH0 is again the hypothesized population mean. – x is the sample mean.

Alternative hypothesis: H1: μx

270

Statistics for Business And, since we know the population standard deviation we can use equation 8(vii) where, ● μ H0 is the hypothesized population mean, or 0.50 litre. – ● x is the sample mean, or 0.5189 litre. – ● The numerator, x μH0, is 0.5189 0.5000 0.0189 litre. ● σ is the population standard deviation, or x 0.05 litre. ● n is the sample size, or 25. ● n 5. Thus, the standard error of the sample is 0.05/5 0.01. The test statistic from equation 8(vii) is, x z σx μH

0

2. At a significance level, α, of 5% is there evidence that the volume of beer in the cans from this bottling line is greater than the target volume of 0.50 litre? Here we are asking the question if there is evidence of the value being greater than the target value and so this is a one-tail, right-hand test. The null and alternative hypotheses are as follows: Null hypothesis: H0: μx 0.50 litre. 0.50 litre.

Alternative hypothesis: H1: μx

n

0.0189 0.01

1.8900

At a significance level of 5% for the test of a difference there is 2.5% in each tail. Using [function NORMSINV] in Excel this gives a critical value of z of 1.96. Since the value of the test statistic or 1.89 is less than the critical value of 1.96, or alternatively within the boundaries of 1.96 then there is no statistical evidence that the volume of beer in the cans is significantly different than 0.50 litre. Thus we would accept the null hypothesis. These relationships are shown in Figure 8.4. Figure 8.4 Filling machine – Case 1.

Nothing has changed regarding the test statistic and it remains 1.8900 as calculated in Question 1. However for a one-tail test, at a significance level of 5% for the test there is 5% in the right tail. The area of the curve for the upper level is 100% 5.0% or 95.00%. Using [function NORMSINV] in Excel this gives a critical value of z of 1.64. Since now the value of the test statistic or 1.89 is greater than the critical value of 1.64 then there is evidence that the volume of beer in all of the cans is significantly greater than 0.50 litre. Conceptually this situation is shown on the normal distribution curve in Figure 8.5. Figure 8.5 Filling machine – Case 2.

2.5% of area

2.5% of area

5.0% of area

1.96 Critical value

1.89 Test statistic

1.96 Critical value

1.64

1.89

Critical value

Test statistic

Chapter 8: Hypothesis testing of a single population

271

**Application when the standard deviation of the population is unknown: Taxes
**

A certain state in the United States has made its budget on the bases that the average individual average tax payments for the year will be $30,000. The financial controller takes a random sample of annual tax returns and these amounts in United States dollars are as follows.

this can be taken as an estimate of the popuˆ lation standard deviation, σx. Estimate of the standard error is, ˆ σx n 17, 815.72 16 $4, 453.93

**From equation 8(vii) the sample statistic is, x t ˆ σx μH
**

0

n

8, 500 4, 453.93

1.8523

34,000 2,000 24,000 23,000

12,000 39,000 15,000 14,000

16,000 7,000 19,000 6,000

10,000 72,000 12,000 43,000

1. At a significance level, α, of 5% is there evidence that the average tax returns of the state will be different than the budget level of $30,000 in this year? The null and alternative hypotheses are as follows: Null hypothesis: H0: μx Alternative hypothesis: H1: μx $30,000. $30,000.

Since we have no information of the population standard deviation, and the sample size is less than 30, we use a Student-t distribution. Sample size, n, is 16. Degrees of freedom, (n 1) are 15.

Since the sample statistic, 1.8523, is not less than the test statistic of 2.1315, there is no reason to reject the null hypothesis and so we accept that there is no evidence that the average of all the tax receipts will be significantly different from $30,000. Note in this situation, as the test statistic is negative we are on the left side of the curve and so we only make an evaluation with the negative values of t. Another way of making the analysis, when we are looking to see if there is a difference, is to see whether the sample statistic of 1.8523 lies within the critical boundary values of t 2.1315. In this case it does. The concept is shown in Figure 8.6.

Figure 8.6 Taxes – Case 1.

Using [function TINV] from Excel the Student-t value is 2.1315 and these are the critical values. Note, that since this is a two-tail test there is 2.5% of the area in each of the tails and t has a plus or minus value. From Excel, using [function AVERAGE]. Mean value of this sample data, x, is $21,750.00.

– x –

2.5% of area

2.5% of area

μx

21,750.00 30,000.00 $8,250.00

2.1315 Critical value

1.8523 Test statistic

2.1315 Critical value

From [function STDEV] in Excel, the sample standard deviation, s, is $17,815.72 and

272

Statistics for Business there is reason to reject the null hypothesis and to accept the alternative hypothesis that there is evidence that the average value of all the tax receipts is significantly less than $30,000. Note that in this situation we are on the left side of the curve and so we are only interested in the negative value of t. This situation is conceptually shown on the Student-t distribution curve of Figure 8.7.

2. At a significance level, α, of 5% is there evidence that the tax returns of the state will be less than the budget level of $30,000 in this year? This is a left-hand, one-tail test and the null and alternative hypothesis are as follows: Null hypothesis: H0: μx Alternative hypothesis: H1: μx $30,000. $30,000.

Again, since we have no information of the population standard deviation, and the sample size is less than 30, we use a Student-t distribution. Sample size, n, is 16. Degrees of freedom, (n 1) is 15.

**Hypothesis Testing for Proportions
**

In hypothesis testing for the proportion we test the assumption about the value of the population proportion. In the same way for the mean, we take a sample from this population, determine the sample proportion, and measure the difference between this proportion and the hypothesized population value. If the difference between the sample proportion and the hypothesized population proportion is small, then the higher is the probability that our hypothesized population proportion value is correct. If the difference is large then the probability that our hypothesized value is correct is low.

Here we have a one-tail test and thus all of the value of α, or 5%, lies in one tail. However, the Excel function for the Student-t value is based on input for a two-tail test so in order to determine t we have to enter the area value of 10% (5% in one tail and 5% in the other tail.) Using [function TINV] gives a critical value of t 1.7531. The value of the sample statistic t remains unchanged at 1.8532 as calculated in Question 1. Since now the sample statistic, 1.8523 is less than the test statistic, 1.8523, then

Figure 8.7 Taxes – Case 2.

**Hypothesis testing for proportions from large samples
**

In Chapter 6, we developed the relationship from the binomial distribution between the popula– tion proportion, p, and the sample proportion p. On the assumption that we can use the normal distribution as our test reference then from equation 6(xii) we have the value of z as follows: z

1.8523 Test statistic 1.7531 Critical value

5.0% of area

p σp

p

p p(1

p p) n

6(xii)

In hypothesis testing for proportions we use an analogy as for the mean where p is now the

Chapter 8: Hypothesis testing of a single population hypothesized value of the proportion and may be written as pH 0. Thus, equation 6(xii) becomes, p z pH σp

0

273

**The standard error of the proportion, or the denominator in equation 8(ix). σp p H (1
**

0

p p H (1

0

pH

0

pH ) n

0

8(ix)

pH )

0

n 0.16 150 0.0327.

0.80 * 0.20 150

The application of the hypothesis testing for proportions is illustrated below.

**Application of hypothesis testing for proportions: Seaworthiness of ships
**

On a worldwide basis, governments say that 0.80, or 80%, of merchant ships are seaworthy. Greenpeace, the environmental group, takes a random sample of 150 ships and the analysis indicates that from this sample, 111 ships prove to be seaworthy. 1. At a 5% significance level, is there evidence to suggest that the seaworthiness of ships is different than the hypothesized 80% value? Since we are asking the question is there a difference then this is a two-tail test with 2.5% of the area in the left tail, and 2.5% in the right tail or 5% divided by 2. From Excel [function NORMSINV] the value of z, or the critical value when the tail area is 2.5% is 1.9600. The hypothesis test is written as follows: H0: p 0.80. The proportion of ships that are seaworthy is equal to 0.80. H1: p 0.80. The proportion of ships that are not seaworthy is different from 0.80. Sample size n is 150.

– Sample proportion p that is seaworthy is 111/150 0.74 or 74%.

– 0.06. p pH0 0.74 0.80 Thus the sample test statistic from equation 6(xii) is,

z

p σp

p

0.06 0.0327

1.8349

Since the test statistic of 1.8349 is not less than 1.9600 then we accept the null hypothesis and say that at a 5% significance level there is no evidence of a significance difference between the 80% of seaworthy ships postulated. Conceptually this situation is shown in Figure 8.8. 2. At a 5% significance level, is there evidence to suggest that the seaworthiness of ships is less than the 80% indicated? This now becomes a one-tail, left-hand test where we are asking is there evidence that

Figure 8.8 Seaworthiness of ships – Case 1.

2.5% of area

2.5% of area

**From the sample, the number of ships that are not seaworthy is 39 (150 111).
**

– Sample proportion q seaworthy is 39/150 – (1 p ) that is not 0.26 or 26%.

1.96 Critical value

1.8349 Test statistic

1.96 Critical value

274

Statistics for Business which then translates into a critical value of z or t, and then test to see whether the sample statistic lies within the boundaries of the critical value. If the test statistic falls within the boundaries then we accept the null hypothesis. If the test statistic falls outside, then we reject the null hypothesis and accept the alternative hypothesis. Thus we have created a binomial “yes” or “no” situation by examining whether there is sufficient statistical evidence to accept or reject the null hypothesis.

Figure 8.9 Seaworthiness of ships – Case 2.

5.0% of area

1.8349

1.6449

**p-value of testing hypothesis
**

An alternative approach to hypothesis testing is to ask, what is the minimum probable level that we will tolerate in order to accept the null hypothesis of the mean or the proportion? This level is called the p-value or the observed level of significance from the sample data. It answers the question that, “If H0 is true, what is the probability of – – obtaining a value of x, (or p, in the case of proportions) this far or more from H0. If the p-value, as determined from the sample, is to α the null hypothesis is accepted. Alternatively, if the p-value is less than α then the null hypothesis is rejected and the alternative hypothesis is accepted. The use of the p-value approach is illustrated by re-examining the previous applications, Filling machine, Taxes, and Seaworthiness of ships.

Test statistic

Critical value

the proportion is less than 80% The hypothesis test is thus written as, H0: p 0.80. The proportion of ships is not less than 0.80. H1: p 0.80. The proportion of ships is less than 0.80. In this situation the value of the sample statistics remains unchanged at 1.8349, but the critical value of z is different. From Excel [function NORMSINV] the value of z, or the critical value when the tail area is 5% is z –1.6449. Now we reject the null hypothesis because the value of the test statistic, 1.8349 is less than the critical value of 1.6449. Thus our conclusion is that there is evidence that the proportion of ships that are not seaworthy is significantly less than 0.80 or 80%. Conceptually this situation is shown on the distribution in Figure 8.9.

**Application of the p-value approach: Filling machine
**

1. At a significance level, α, of 5% is there evidence that the volume of beer in the cans from this bottling line is different than the target volume of 0.50 litre? As before, a sample of 25 cans is taken and the average of the sample volume is 0.5189 litre. The test statistic, x z σx μH

0

**The Probability Value in Testing Hypothesis
**

Up to this point our method of analysis has been to select a significance level for the hypothesis,

n

0.0189 0.01

1.8900.

2.8523 and from Excel [function TDIST] this sample statistic. Since 4. 275 2. As 2. indicates a probability of 8. for a two-tail test. of 5% is there evidence that the average tax returns of the state will be different than the budget level of $30.00% we reject the null hypothesis and accept the alternative hypothesis and conclude that there is evidence that the volume of beer in the cans is greater than 0.000 in this year? The sample statistic gives a Student-t value which is equal to 1. Since now 3.94% 5. α. Thus the area in the right-hand tail is 100% 97.38% 5.31% 2. This is the same conclusion as before. the area in each of the tail set by the significance level is 2.50 litre. Since we have a two-tail test.38%.31% 5. of 5% is there evidence that the tax returns of the state will be less than the budget level of $30. for a one-tail test.06% 2. At a significance level.00% we reject the null hypothesis and conclude that there is evidence to indicate that the average tax receipts are significantly less than $30.94%.50% we accept the null hypothesis and conclude that there is no evidence to indicate that the seaworthiness of ships is different from the hypothesized value of 80%.06 0. left-hand test then there is 5% in the tail.19%. indicates a probability of 4.000.94%. This is the same conclusion as before. At a significance level.8523. Since this is a two-tail test the area in the left tail is also 2. Application of the p-value approach: Seaworthiness of ships 1.31%.94%. As this is a two-tail test then there is 2. From Excel [function TDIST] this sample statistic of 1.000.Chapter 8: Hypothesis testing of a single population From Excel [function NORMSDIST] for a value of z. α.50% then we accept the null hypothesis and conclude that the volume of beer in the cans is not different from 0. This is the same conclusion as before.5% in each tail.19% 5. At a significance level. of 5% is there evidence that the volume of beer in the cans from this bottling line is greater than the target volume of 0. From Excel [function NORMSDIST] this sample statistic.8900 gives an area in the right-hand tail of 2.00% we reject the null hypothesis and conclude that there is evidence to indicate that the seaworthiness of ships is less than the hypothesized value of 80%.94% 2.06%. 2. for a two-tail test.50%. At a 5% significance level. Since 3. α.8523. is there evidence to suggest that the seaworthiness of ships is less than the 80% indicated? As this is a one-tail. Since 8.00% we accept the null hypothesis and conclude that there is no evidence to indicate that the average tax receipts are significantly less than $30.000 in this year? The sample statistic gives a t-value which is equal to 1.0327 1.50 litre? The value of the test statistic of 1. is there evidence to suggest that the seaworthiness of ships is different than the 80% indicated? z p σp p 0. . right-hand test when the significance level is 5%. At a 5% significance level. This is the same conclusion as before.8349 Application of the p-value approach: Taxes 1.50 litre.8900 area of the curve from the left is 97. Since 2. the 1. We now have a one-tail. indicates a probability of 3.

19 11. Table 8. Thus we have observed an unlikely event or an event so unlikely that we should doubt our assumptions about the population mean in the first place. In this case. α used for hypothesis testing then the higher is the percentage of the distribution in the tails. the smaller is the p-value. Note. with a high significance level.28 0. that in order to calculate the value of the test statistic we assumed that the null hypothesis is true and thus we have reason to reject the null hypothesis and accept the alternative. At the 1% significance level.5000 litre tend to indicate that the alternative hypothesis is true or the smaller the p-value. You cannot make a probability assumption about the population parameter 0.1 Sample mean and a corresponding z and p-value.5189 litre.10.00 34.5 litre. Consider Table 8. the sample size obtained is 0.51 5. and the corresponding p-value for the filling machine situation. the risk of rejecting a null hypothesis when it is in fact true is greater at a 50% significance level. or is not true.5160 0.2000 1. then as α increases there is a greater probability of rejecting the null hypothesis when in fact it is true. Looking at it another way.5200 0. Sample mean x-bar 0.5000 litre.46 21. The p-value provides useful information as it measures the amount of statistical evidence that supports the alternative hypothesis. than at a 1% significance level. that is a high value of α. when it is false is greater than at a significance level of 50%. or moves further away from the hypothesized population mean of – 0.0000 0.48 2. Interpretation of the p-value In hypothesis testing we are making inferences about a population based only on sampling.4000 0.5120 0. Alternatively.5189 litre from a population whose mean is 0.5040 0.276 Statistics for Business far above 0.8000 1. Errors in hypothesis testing The higher the value of the significance level.5000 litre as this is not a random variable. These errors in hypothesis testing are referred to as Type I or Type II errors. when α is high. the greater is the probability of rejecting a null hypothesis.94% or quite small.5000 0.6000 2. the value of the test statistic. the probability of accepting the hypothesis.82 . Since the null hypothesis is true.5240 Test statistic z 0.0000 2. it is unlikely we would accept a null hypothesis when it is in fact not true. This relationship is illustrated in the normal distributions of Figure 8.5000 litre is 2. which gives values of the sample mean. Values of x Risks in Hypothesis Testing In hypothesis testing there are risks when you sample and then make an assumption about the population parameter. This is to be expected since statistical analysis gives no guarantee of the result but you hope that the risk of making a wrong decision is low.5080 0. The probability of obtaining a sample mean of 0. In the case of the filling machine for example where we are asking is there evidence that the volume of beer in the can is greater than 0. Remember that the p-value is not to be interpreted by saying that it is the probability that the null hypothesis is true.1. As the sample mean gets larger. The sampling distribution permits us to make probability statements about a sample statistic on the basis of the knowledge of the population parameter.4000 p-value % 50. the more the statistical evidence there is to support the alternative hypothesis.

Cost of making an error Consider that a pharmaceutical firm makes a certain drug. However. A Type II error is accepting a null hypothesis when it is not true. 99% The higher the significance level for testing the hypothesis. In reality the batch was good and could have been accepted. However. In this case the produced pharmaceutical product is accepted and commercialized but it does not conform to quality specifications. we will often reject a null hypothesis when it is in fact true. it is unlikely we would accept a null hypothesis when it is false. suppose the quality inspector makes a Type II error.5% of area x Significance level of 1% 90% 5% of area 5% of area 50% x Significance level of 10% 25% of area Significance level is the total area in the tail(s) Significance level of 50% x 25% of area A Type I error occurs if the null hypothesis is rejected when in fact it is true. The level of significance to use depends on the cost of the error as illustrated as follows. Alternatively.Chapter 8: Hypothesis testing of a single population 277 Figure 8. He makes a Type I error in his analysis. As a result. This may mean .5% of area 0. all the production quantity in the reaction vessel is dumped and the firm starts the production all over again. When the acceptance region is small. or α is large. A quality inspector tests a sample of the product from the reaction vessel where the drug is being made. The probability of a Type I error is called α where α is also the level of significance. In this case. the greater is the probability of rejecting a null hypothesis when it is true. The probability of a Type II error is called β. or accepts a null hypothesis when it is in fact false. the firm incurs all the additional costs of repeating the production operation. we would rarely accept a null hypothesis when it is not true. 0. at a risk of being this sure.10 Selecting a significance level. That is he rejects a null hypothesis when it is true or concludes from the sample that the drug does not conform to quality specifications when in fact it really does.

Alternatively. α. when it is in fact true. Thus. is that a person if charged with murder is considered innocent of the crime and the court has to prove guilt. does not equal the hypothesized population value but instead equals some other value. the measure of how well the test is doing. β. utmost care must be made to ensure that the sample taken is a true representation of the population. the cost of the error is relatively low and manufacturer is more likely to prefer a Type II error even though the marketing image may be damaged. the jury would prefer to commit a Type II error or accepting a null hypothesis that the person is innocent. when it is in fact not true. Suppose in another situation. Again. In this situation. to be as large as possible. β of accepting the null hypothesis when it is false. Rejecting a null hypothesis when it is false is exactly what a good hypothesis test ought to do.10. the manufacturer will set low levels for α such as 10% as illustrated in Figure 8. The value of (1 β). we would like (1 β) the probability of rejecting a null hypothesis when it is false. or the probability of making a Type II error β to be small. An inspector takes a sample of this component from the production line and measures the appropriate properties. He makes a Type I error in the analysis.278 Statistics for Business that users of the drug could become sick. Thus. In this case the person would be found guilty and risk the death penalty (at least in the United States) for a crime that they did not commit. a low value of (1 β) approaching zero means that the test is not working well and the test is not rejecting the null hypothesis when it is false. In this case. The cost of an error in some situations might be infinite and irreparable. and thus let the guilty person go free. In this case to correct this conclusion would involve an expensive disassembly operation of many components on the shop floor that have already been produced. hypothesis tests are not perfect and when a null hypothesis is false. In this latter case. rather than take the risk of poisoning the users. This implies having a high value of α such as 50% as illustrated in Figure 8. Alternatively. For each possible value for which the alternative hypothesis is true. Consider for example a murder trial. a pharmaceutical firm would prefer to make a Type I error. this might involve a less expensive warranty repairs by the dealers when the washing machines are commercialized. We would like this value of β to be as small as possible. in order to avoid errors in hypothesis testing. However. The “cost” of this error would be very high. The alternative would be to accept a Type I error or rejecting the null hypothesis that the person is innocent.2 summarizes the four possibilities that can occur in hypothesis testing and what type of errors might be incurred. Table 8. In this case. if a null hypothesis is false then we would like the hypothesis test to reject this conclusion every time. A high value of (1 β) approaching 1. or at worse die. When the null hypothesis is false this implies that the true population value. or accepting a null hypothesis when it is in fact false. in hypothesis testing we would like the probability of making a Type I error. is called the power of the test. He rejects the null hypothesis that the component conforms to specifications.0 means that the test is working well. is made or that is accepting a null hypothesis when it is false. there is a different probability. a test may not reject it and consequently a Type II error. Under Anglo-Saxon law the null hypothesis. when in fact the null hypothesis is true. Power of a test In any analytical work we would like the probability of making an error to be small.10. as in all statistical work. On the other hand if the inspector had made a Type II error. a manufacturing firm is making a mechanical component that is used in the assembly of washing machines. . or the null hypothesis is false. or destroying the production lot.

right-hand test. The significance level establishes a critical value. α is made In reality for the population – null hypothesis. which is the barrier beyond which decisions will change. In all of these tests the first step is to determine a sample test value. hypothesis testing for proportions. α which is the level of importance in the difference between values before we accept an alternative hypothesis. β is made • • • • Test statistic falls in the region α Decision is correct No error is made Power of test is (1 β) Decision you make Null hypothesis. which is the announced value. H0 is rejected This chapter has dealt with hypothesis testing or making objective decisions based on sample data. H0 is false – what your test indicates • Test statistic falls in the region (1 α) • Decision is incorrect • A Type II error. Chapter Summary Concept of hypothesis testing Hypothesis testing is to sample from a population and decide whether there is sufficient evidence to conclude that the hypothesis appears correct. Another is to test to see if there is evidence that a value is significantly greater than the hypothesized amount. Hypothesis testing for the mean value In hypothesis testing for the mean we are trying to establish if there is statistical evidence to accept a hypothesized average value.2 Sample mean and a corresponding z and p-value. The first is to establish if there is a significant difference from the hypothesize mean. This gives rise to a one-tail. We can have three frames of references. The third is a left-hand test that decides if a value is significantly less than a hypothesized value. Then there is the alternative hypothesis. We . There is the null hypothesis denoted by H0. H0 is accepted Null hypothesis. H1 which is the other situation we accept should we reject the null hypothesis. In testing we need to decide on a significance level. the probability value in testing hypothesis. In reality for the population – null hypothesis. The concept of hypothesis testing is binomial. then presented hypothesis testing for the mean. and finally summarized the risks in hypothesis testing. When we reject the null hypothesis we automatically accept the alternative hypothesis.Chapter 8: Hypothesis testing of a single population 279 Table 8. H0 is true – what your test indicates • Test statistic falls in the region (1 α) • Decision is correct • No error is made • Test statistic falls in the region α • Decision is incorrect • A Type I error. depending on our knowledge of the population. or t. either z. This gives a two-tail test. The chapter opened with describing the concept of hypothesis testing.

However if we have a high value of α. This outcome is called a Type I error. Hypothesis testing for proportions The hypothesis test for proportions is similar to the test for the mean value but here we are trying to see if there is sufficient statistical evidence to accept or reject a hypothesized population proportion. which is a direct consequence of our significance level.280 Statistics for Business then compare this test value to our critical value. If our test value is within the limits of the critical value. The value of (1 β). a one-tail. is a measure of how well the test is doing and is called the power of the test. If we select a high level of significance. If the test statistic is within our boundary limits we accept the null hypothesis. for hypothesis testing is an alternative approach to the critical value method for testing assumptions about the population mean or the population proportion. We than determine the value of our sample statistic and compare this to the critical value determined from our significance level. right-hand test. When the p-value is less than α. we reject the null hypothesis and accept the alternative hypothesis. left-hand test. The closer the value of (1 β) is to unity implies that the test is working quite well. . the risk of accepting a null hypothesis when it is false is low. we accept the null hypothesis. which means a large value of α the greater is the risk of rejecting a null hypothesis when it is in fact true. or p-value. The p-value is the minimum probability that we will tolerate before we reject the null hypothesis. Otherwise we reject the null hypothesis and accept the alternative hypothesis. otherwise we reject it. We establish a significance level and this sets our critical value of z. or a one-tail. As for the mean. we can have a two-tail test. A Type II error called β occurs if we accept a null hypothesis when it is in fact false. The probability value in testing hypothesis The probability. Risks in hypothesis testing As in all statistical methods there are risks when hypothesis testing is carried out. The criterion is that we can assume the normal distribution in our analytical procedure. our level of significance.

the sugar producer. The average life of the sample of these bulbs is 2. is there evidence that the net weight of the bags of sugar is different than 1 kg? 5. Required 1. What are the confidence limits corresponding to a significance level of 5%. . Sugar Situation One of the processing plants of Béghin Say. is there evidence to suggest that the life of the light bulbs is different than 2. If you use the p-value for testing are you able to verify your conclusions in Question 1? Explain your reasoning. The quality control inspector takes a random sample of 22 bags of sugar and finds that the weight of this sample is 1.500 hours? 2. It is known from experience that the standard deviation of the filling operation is 15 g. with a standard deviation of 40 hours. Why is it necessary to use a difference test? Why should this processing plant be concerned with the results? 2.485 hours.500 hours. 6. 3. using the critical value method.500 hours. If you use the p-value for testing are you able to verify your conclusions in Question 4? Explain your reasoning.Chapter 8: Hypothesis testing of a single population 281 EXERCISE PROBLEMS 1. Before the firm finalizes the purchase it takes a random sample of 20 neon bulbs and tests them until they burn out. (Note.006 g. At a significance level of 10% for analysis. has problems controlling the filling operation for its 1 kg net weight bags of white sugar. What are the confidence limits corresponding to a significance level of 10%. At a significance level of 5% for analysis. The subsidiary claims that the life of the light bulbs is 2. is there evidence that the net weight of the bags of sugar is different than 1 kg? 2. If you use the p-value for testing are you able to verify your conclusions in Question 1? Explain your reasoning. How do these values corroborate your conclusions for Questions 4 and 5? 7. How do these values corroborate your conclusions for Questions 1 and 2? 4. the firm has a special simulator that tests the bulb and in practice it does not require that the bulbs have to be tested for 2.) Required 1. at a 5% significance level. Neon lights Situation A firm plans to purchase a large quantity of neon light bulbs from a subsidiary of GE for a new distribution centre that it is building. using the critical value method. Using the critical value approach.

If the results from the Questions 3 and 4 what options are open to the purchasing firm? 3. At a 5% significance level. Using the critical value approach.7197 0.6900 0.7200 0. If you use the p-value for testing are you able to verify your conclusions in Question 3? Explain your reasoning.7788 0.7600 Required 1.6200 0. At a 10% significance level. How do these values corroborate your conclusions for Questions 1 and 2? 4.7800 0. If you applied the appropriate one-tail test what conclusions would you draw? Explain your logic. is there evidence to suggest that the diameter of the lead is different from the supplier’s claim? 2. Graphite lead Situation A company is selecting a new supplier for graphite leads which it uses for its Pentel-type pencils.7800 0.282 Statistics for Business 3.7700 0. Explain your reasoning. using the p-value concept. Industrial pumps Situation Pumpet Corporation manufactures electric motors for many different types of industrial pumps.7100 0.6600 0. At a 10% significance level.6600 0. 6.7100 0. What are the confidence limits corresponding to a significance level of 10%. Explain your reasoning. using the p-value concept.7030 0.05 mm. How do these values corroborate your conclusions for Questions 4 and 5? 7. 0.7500 0. The supplier claims that the average diameter of its leads is 0. The company wishes to verify this claim because if the lead is significantly too thin it will break.6500 0. verify your answer obtained in Question 1. The mean of the sample data is an indicator whether the lead is too thin or too thick. using the critical value concept.7200 0. At a 5% significance level.7000 0. 5.7900 0.500 hours? 4. 4. verify your answer obtained in Question 1. using the critical value concept. One of the parts is the drive shaft that attaches to the pump. What are the confidence limits corresponding to a significance level of 5%. is there evidence to suggest that the diameter of the lead is different from the supplier’s claim? 5. is there evidence to suggest that the life of the light bulbs is less than 2.6975 0. An important criterion . If it is significantly too thick it will jam in the pencil. 3.7540 0.6600 0.7100 0. at a 5% significance level.6900 0.6888 0.7598 0.6960 0. It takes a sample of 30 of these leads and measures the diameter with a micrometer gauge.7 mm with a standard deviation of 0. The diameter of the samples is given in the table below.7012 0.7090 0.7660 0.

In this case.86 100.20 101.96 97. the shaft vibrates.00 97.Chapter 8: Hypothesis testing of a single population 283 for the drive shafts is that they should not be below a certain diameter.65 100.01 98.77 98.39 101.64 99.77 99.20 99.13 101.25 100.21 98.01 98. what might you recommend? . Explain your reasoning.99 100.21 99.46 100.78 98.45 102.76 100. how would your conclusions from the Questions 1 to 4 change? 6.99 103. the specification calls for a nominal diameter of the drive shaft of 100 mm. is there evidence that the drive shaft diameter is significantly below 100 mm causing the lot to be rejected? Explain your reasoning.77 98. If this is the case.77 97.47 100.50 100.15 99.78 97.12 98.45 99.00 99. 2.25 98.45 97. is there evidence that the shaft diameter of model MT 2501 is significantly below 100 mm? If so there would be cause to reject the lot. The company took a sample of 120 drive shafts from a large manufactured lot and measured their diameter.77 99.90 97.20 99. In the way that the drive shafts are machined.76 99.46 99.98 100.10 98.78 99.18 99.79 99.56 99.56 100.56 99.00 100.33 98.24 99.21 98.56 100.15 100.78 102.45 101.00 101.98 98.78 99.98 98. 4.87 99.19 100.56 98.45 97. MT 2501.77 101.23 99.11 99.77 100.77 100.98 99.94 99.45 98.76 98.00 101.78 101.27 100.69 101.09 99.55 99.45 101.24 99.44 102.77 98.75 99. If instead of using the whole sample size indicated in the table you used just the data in the first three columns.09 100.23 98. and again making the test using the critical value criteria.78 Required 1.78 101.99 97.99 98. using the critical value method. 3.12 100.56 99.01 100.44 98.00 99.48 101.24 99. there are never problems of the shafts being oversized. If you use the p-value for testing are you able to verify your conclusions in Question 1? Explain your reasoning.23 101.20 99.22 97.77 100. 5.15 98. Using this level.98 102. If you use the p-value for testing are you able to verify your conclusions in Question 3? Explain your reasoning. For one particular model. A particular client of Pumpet insists that a significance level of 10% be used for analysis as they have stricter quality control limits.24 99.76 98.76 102. From your answer to Question 5.99 98.77 99.78 100. Pumpet normally uses a significance level of 5% for its analysis.20 100.45 100. The results were as follows: 100. then when in use.78 101.75 100.98 101.23 98.44 98.87 100.23 100. and eventually breaks.76 100.97 99.45 99.64 100.

In this case banks need to have a reasonable estimate of how much cash to make available in their ATMs. Re-examine Question 1 using the p-value approach.284 Statistics for Business 5. For BNP-Paribas in its branches in the Rhone region in the Southeast of France it estimates that for this 2. then at the 5% significance level does this data indicate that the mean withdrawal from the machines is different from €3. A random sample of the withdrawal from 36 of its branches indicate a sample average withdrawal of €3. Using the concept of critical values.5 cm.5 days from Saturday afternoon to Tuesday morning. In a production lot of legs for bar stools the quality . Automatic teller machines (ATMs) Situation Banks in France are closed for 2. Here we have used the test for a difference. What are the confidence limits at 1% significance? How do these values corroborate your answers to Questions 4 and 5? 7.5-day period the demand from its customers from those branches with a single ATM machine is €3. If the length is more than 70 cm they can be shaved down to the required length. Required 1.235. The specifications require that the length of the legs of the bar stools is 70 cm. Are your conclusions the same? Explain your conclusions? 3. either left or right hand? 6.200? 5. What are the confidence limits at 5% significance? How do these values corroborate your answers to Questions 1 and 2? 4.200? 2. In the production process of the bar stools the pieces are cut before shaping and assembling. Re-examine Question 4 using the p-value approach. In the production of the legs for the bar stools it is known that the standard deviation of the process is 2. Are your conclusions the same? Explain your conclusions. However if pieces are significantly less than 70 cm they cannot be used for bar stools and are sent to another production area where they are re-cut to use in the assembly of standard chair legs. 6. Why is the bank interested in the difference rather than a one-tail test. Bar stools Situation A supplier firm to IKEA makes wooden bar stools of various styles.200 with a population standard deviation of €105. Using the concept of critical values then at the 1% significance level does this data indicate that the mean life of the population of the batteries is different from €3.

002. 5. does your answer corroborate the conclusion of Question 3? Give your reasoning. indicates on the label that the nominal volume of the salad dressing is 1.2 999. we used the Student-t distribution.000.0 1.001.00 ml. 3. does this sample data indicate that the length of the legs is less than 70 cm? 2. using the p-value concept. In this case.001.0 998.2 997. using the concept of critical value testing.9 996. At a 10% significance level.0 1.000.002.4 1. The quality control inspector takes a random sample of 25 of the bottles from the production line and measures their volumes.000.1 994.2 1.005.0 1. In the filling process the firm knows that the standard deviation is 5.Chapter 8: Hypothesis testing of a single population 285 control inspector takes a random sample and the length of these is according to the following table.0 1.9 1. does this sample data indicate that the length of the legs is less than 70 cm? 4. Salad dressing Situation Amora salad dressing is made in Dijon in France. At the 10% significance level.001.3 995.8 992. Since we know the standard deviation we are correct to use the normal distribution for this hypothesis test. 993. does this sample data indicate that the volume of salad dressing in the bottles is different than the volume indicated on the label? . does your answer corroborate the conclusion of Question 1? Give your reasoning.000 ml. At a 5% significance level.005. At the 5% significance level. made with wine.0 1. At the 5% significance level.0 996.7 995.0 Required 1. using the concept of critical value testing.9 994. 65 71 67 69 74 75 70 69 69 74 68 69 68 68 68 67 68 67 68 72 72 69 71 66 72 67 67 67 73 68 70 72 Required 1. using the concept of critical value testing. which are given in the following table.9 1. One of their products. Assume that we did not know the process standard deviation and as the sample size of 32 is close to a cut-off point of 30.5 993. using the p-value concept. would our analysis change the conclusions of Questions 1to 4? 7.7 1.2 1.0 992.4 997.002.

286

Statistics for Business

2. At the 5% significance level, using the p-value concept, does your answer corroborate the conclusion of Question 1? Give your reasoning. 3. At the 5% significance level, what are the confidence intervals when the test is asking for a difference in the volume? How do these intervals confirm your answers to Questions 1 and 2? 4. At the 5% significance level, using the concept of critical value testing, does this data indicate that the volume of salad dressing in the bottles is less than the volume indicated on the label? 5. At the 5% significance level, using the p-value concept, does your answer corroborate the conclusion of Question 4? Give your reasoning. 6. Why is the test mentioned in Question 4 important? 7. What can you say about the sensitivity of this sampling experiment?

8. Apples

Situation

In an effort to reduce obesity among children, a firm that has many vending machines in schools is replacing chocolate bars with apples in its machines. Unlike chocolate bars that are processed and thus the average weight is easy to control, apples vary enormously in weight. The vending firm asks its supplier of apples to sort them before they are delivered as it wants the average weight to be 200 g. The criterion for this is that the vending firm wants to be reasonably sure that each child who purchases an apple is getting one of equivalent weight. A truck load of apples arrives at the vendor’s depot and an inspector takes a random sample of 25 apples. The following is the weight of each apple in the sample.

198 199 207 195 199 201 208 195 190 205 202 196 187 195 203 186 196 199 197 190 199 196 189 209 199

Required

1. At the 5% significance level, using the concept of critical value testing, does this sample data indicate that the weight of the truck load of apples is different than the desired 200 g? 2. At the 5% significance level, using the p-value concept, does your answer corroborate the conclusion of Question 1? Give your reasoning. 3. At the 5% significance level what are the confidence intervals when the test is asking for a difference in the volume. How do these intervals confirm your answers to Questions 1 and 2? 4. At the 5% significance level, using the concept of critical value testing, does this sample data indicate that the weight of the truck load of apples is less than the desired 200 g? 5. At the 5% significance level, using the p-value concept, does your answer corroborate the conclusion of Question 4? Give your reasoning.

Chapter 8: Hypothesis testing of a single population

287

9. Batteries

Situation

A supplier of batteries claimed that for a certain type of battery the average life was 500 hours. The quality control inspector of a potential buying company took a random sample of 15 of these batteries from a lot and tested them until they died. The life of these batteries in hours is given in the table.

350 485 489

925 546 568

796 551 685

689 512 578

501 589 398

Required

1. Using the concept of critical values, then at the 5% significance level does this data indicate that the mean life of the population of the batteries is different from the hypothesized value? 2. Re-examine Question 1 using the p-value approach. Are your conclusions the same? Explain your reasoning? 3. Using the concept of critical values then at the 5% significance level does this data indicate that the mean life of the population of the batteries is greater than the hypothesized value? 4. Re-examine Question 3 using the p-value approach. Are your conclusions the same? Explain your reasoning? 5. Explain the rationale for the differences in the answers to Questions 1 and 3, and the differences in the answers to Questions 2 and 4.

**10. Hospital emergency
**

Situation

A hospital emergency service must respond rapidly to sick or injured patients in order to increase rate of survival. A certain city hospital has an objective that as soon as it receives an emergency call an ambulance is on the scene within 10 minutes. The regional director wanted to see if the hospital objectives were being met. Thus during a weekend (the busiest time for hospital emergencies) a random sample of the time taken to respond to emergency calls were taken and this information, in minutes, is in the table below.

8 12 9

14 7 17

15 8 22

20 21 10

7 13 9

288

Statistics for Business

Required

1. At the 5% significance level, using the concept of critical value testing, does this sample data indicate that the response time is different from 10 minutes? 2. At the 5% significance level, using the p-value concept, does your answer corroborate the conclusion of Question 1? Give your reasoning. 3. At the 5% significance level what are the confidence intervals when the test is asking for a difference? How do these intervals confirm your answers to Questions 1 and 2? 4. At the 5% significance level, using the concept of critical value testing, does this data indicate that the response time for an emergency call is greater than 10 minutes? 5. At the 5% significance level, using the p-value concept, does your answer corroborate the conclusion of Question 4? Give your reasoning. 6. Which of these two tests is the most important?

**11. Equality for women
**

Situation

According to Jenny Watson, the commission’s chair of the United Kingdom Sex Discrimination Act (SDA), there continues to be an unacceptable pay gap of 45% between male and female full time workers in the private sector.1 A sample of 72 women is taken and of these 22 had salaries less than their male counterparts for the same type of work.

Required

1. Using the critical value approach for a 1% significance level, is there evidence to suggest that the salaries of women is different than the announced amount of 45%? 2. Using the p-value approach are you able to corroborate your conclusions from Question 1. Explain your reasoning. 3. What are the confidence limits at the 1% level? How do they agree with your conclusions of Questions 1 and 2? 4. Using the critical value approach for a 5% significance level, is there evidence to suggest that the salaries of women is different than the announced amount of 45%? 5. Using the p-value approach are you able to corroborate your conclusions from Question 3. Explain your reasoning. 6. What are the confidence limits at the 5% level? How do they agree with your conclusions of Questions 1 and 2? 7. How would you interpret these results?

**12. Gas from Russia
**

Situation

Europe is very dependent on natural gas supplies from Russia. In January 2006, after a bitter dispute with Ukraine, Russia cut off gas supplies to Ukraine but this also affected other

1

Overell, S., Act One in the play for equality, Financial Times, 5 January 2006, p. 6.

Chapter 8: Hypothesis testing of a single population

289

European countries’ gas supplies. This event jolted European countries to take a re-look at their energy policies. Based on 2004 data the quantity of imported natural gas of some major European importers and the amount from Russia in billions of cubic metres was according to the table below.2 The amounts from Russia were on a contractual basis and did not necessarily correspond to physical flows.

Country Germany Italy Turkey France Hungary Poland Slovakia Czech Republic Austria Finland Total imports (m3 billions) 91.76 61.40 17.91 37.05 10.95 9.10 7.30 9.80 7.80 4.61 Imports from Russia (m3 billions) 37.74 21.00 14.35 11.50 9.32 7.90 7.30 7.18 6.00 4.61

Industrial users have gas flow monitors at the inlet to their facilities according to the source of the natural gas. Samples from 35 industrial users were taken from both Italy and Poland and of these 7 industrial users in Italy and 31 in Poland were using gas imported from Russia.

Required

1. Using the critical value approach at a 5% significance level, is there evidence to suggest that the proportion of natural gas Italy imports from Russia is different than the amount indicated in the table? 2. Using the p-value approach are you able to corroborate your conclusions from Question 1. Explain your reasoning. 3. What are the confidence limits at the 5% level? How do they agree with your conclusions of Questions 1 and 2? 4. Using the critical value approach for a 10% significance level, is there evidence to suggest that the proportion of natural gas Italy imports from Russia is different than the amount indicated in the table? 5. Using the p-value approach are you able to corroborate your conclusions from Question 4. Explain your reasoning. 6. What are the confidence limits at the 10% level? How do they agree with your conclusions of Questions 4 and 5?

White, G.L., “Russia blinks in gas fight as crisis rattles Europe”, The Wall Street Journal, 3 January 2005, pp. 1–10.

2

290

Statistics for Business

7. Using the critical value approach at a 5% significance level, is there evidence to suggest that the proportion of natural gas Poland imports from Russia is different than the amount indicated in the table? 8. Using the p-value approach are you able to corroborate your conclusions from Question 7. Explain your reasoning. 9. What are the confidence limits at the 5% level? How do they agree with your conclusions of Questions 7 and 8? 10. Using the critical value approach for a 10% significance level, is there evidence to suggest that the proportion of natural gas Poland imports from Russia is different than the amount indicated in the table? 11. Using the p-value approach are you able to corroborate your conclusions from Question 10. Explain your reasoning. 12. What are the confidence limits at the 10% level? How do they agree with your conclusions of Questions 10 and 11? 13. How would you interpret these results of all these questions?

**13. International education
**

Situation

Foreign students are most visible in Australian and Swiss universities, where they make up more than 17% of all students. Although the United States attracts more than a quarter of the world’s foreign students, they account for only some 3.5% of America’s student population. Almost half of all foreign students come from Asia, particularly China and India. Social sciences, business, and law are the fields of study most popular with overseas scholars. The table below gives information for selected countries for 2003.3

Country Australia Austria Belgium Britain Czech Republic Denmark France Germany Greece Hungary Ireland Italy Foreign students as % of total 19.0 13.5 11.5 11.5 4.5 9.0 10.0 10.5 2.0 3.0 6.0 2.0 Country Japan Netherlands New Zealand Norway Portugal South Korea Spain Sweden Switzerland Turkey United States Foreign students as % of total 2.0 4.0 13.5 5.5 4.0 0.5 3.0 8.0 18.0 1.0 3.5

3

Economic and financial indicators, The Economist, 17 September 2005, p. 108.

Chapter 8: Hypothesis testing of a single population

291

Random samples of 45 students were selected in Australia and in Britain. Of those in Australia, 14 were foreign, and 10 of those in Britain were foreign.

Required

1. Using the critical value approach, at a 1% significance level, is there evidence to suggest that the proportion of foreign students in Australia is different from that indicated in the table? 2. Using the p-value approach are you able to corroborate your conclusions from Question 1. Explain your reasoning. 3. What are the confidence limits at the 1% level? How do they agree with your conclusions of Questions 1 and 2? 4. Using the critical value approach, at a 5% significance level, is there evidence to suggest that the proportion of foreign students in Australia is different from that indicated in the table? 5. Using the p-value approach are you able to corroborate your conclusions from Question 4. Explain your reasoning. 6. What are the confidence limits at the 5% level? How do they agree with your conclusions of Questions 1 and 2? 7. Using the critical value approach, at a 1% significance level, is there evidence to suggest that the proportion of foreign students in Britain is different from that indicated in the table? 8. Using the p-value approach are you able to corroborate your conclusions from Question 7. Explain your reasoning. 9. What are the confidence limits at the 1% level? How do they agree with your conclusions of Questions 7 and 8? 10. Using the critical value approach, at a 5% significance level, is there evidence to suggest that the proportion of foreign students in Britain is different from that indicated in the table? 11. Using the p-value approach are you able to corroborate your conclusions from Question 10. Explain your reasoning. 12. What are the confidence limits at the 5% level? How do they agree with your conclusions of Questions 10 and 11?

**14. United States employment
**

Situation

According to the United States labour department the jobless rate in the United States fell to 4.9% at the end of 2005. It was reported that 108,000 jobs were created in December and 305,000 in November. Taken together, these new jobs created over the past 2 months allowed the United States to end the year with about 2 million more jobs than it had

Andrews, E.L., “Jobless rate drops to 4.9% in U.S,” International Herald Tribune, 7/8 January 2006, p. 17.

4

292

Statistics for Business

12 months ago.4 Random samples of 83 people were taken in both Palo Alto, California and Detroit, Michigan. Of those from Palo Alto, 4 said they were unemployed and 8 in Detroit said they were unemployed.

Required

1. Using the critical value approach, at a 5% significance level, is there evidence to suggest that the unemployment rate in Palo Alto is different from the national unemployment rate? 2. Using the p-value approach are you able to corroborate your conclusions from Question 1. Explain your reasoning. 3. What are the confidence limits at the 5% level? How do they agree with your conclusions of Questions 1 and 2? 4. Using the critical value approach, at a 10% significance level, is there evidence to suggest that the unemployment rate in Palo Alto is different from the national unemployment rate? 5. Using the p-value approach are you able to corroborate your conclusions from Question 4. Explain your reasoning. 6. What are the confidence limits at the 10% level? How do they agree with your conclusions of Questions 4 and 5? 7. Using the critical value approach, at a 5% significance level, is there evidence to suggest that the unemployment rate in Detroit is different from the national unemployment rate? 8. Using the p-value approach are you able to corroborate your conclusions from Question 7. Explain your reasoning. 9. What are the confidence limits at the 5% level? How do they agree with your conclusions of Questions 7 and 8? 10. Using the critical value approach, at a 10% significance level, is there evidence to suggest that the unemployment rate in Detroit is different from the national unemployment rate? 11. Using the p-value approach are you able to corroborate your conclusions from Question 10. Explain your reasoning. 12. What are the confidence limits at the 10% level? How do they agree with your conclusions of Questions 10 and 11? 13. Explain your results for Palo Alto and Detroit.

**15. Mexico and the United States
**

Situation

On 30 December 2005 a United States border patrol agent shot dead an 18-year-old Mexican as he tried to cross the border near San Diego, California. The patrol said the shooting was in self-defence and that the dead man was a coyote, or people smuggler. In

Chapter 8: Hypothesis testing of a single population

293

2005, out of an estimated 400,000 Mexicans who crossed illegally into the United States, more than 400 died in the attempt. Illegal immigration into the United States has long been a problem and to control the movement there are plans to construct a fence along more than a third of the 3,100 km border. According to data for 2004, there are some 10.5 million Mexicans in the United States, which represents some 31% of the foreign-born United States population. The recorded Mexicans in the United States of America is equivalent to 9% of Mexico’s total population. In addition, it is estimated that there are some 10 million undocumented immigrants in the United States of which 60% are considered to be Mexican.5 A random sample of 57 foreign-born people were taken in the United States and of these 11 said they were Mexican and of those 11, two said they were illegal.

Required

1. What is the probability that a Mexican who is considering to cross the United States border will die or be killed in the attempt? 2. Using the critical value approach, at a 5% significance level, is there evidence to suggest that the proportion of Mexicans, as foreign-born people, living in the United States is different from the indicated data? 3. Using the p-value approach are you able to corroborate your conclusions from Question 2. Explain your reasoning. 4. What are the confidence limits at the 5% level? How do they agree with your conclusions of Questions 1 and 2? 5. Using the critical value approach, at a 10% significance level, is there evidence to suggest that the proportion of Mexicans, as foreign-born people, living in the United States is different from the indicated data? 6. Using the p-value approach are you able to corroborate your conclusions from Question 5. Explain your reasoning. 7. What are the confidence limits at the 10% level? How do they agree with your conclusions of Questions 5 and 6? 8. Using the critical value approach, at a 5% significance level, is there evidence to suggest that the number of undocumented Mexicans living in the United States is different from the indicated data? 9. Using the p-value approach are you able to corroborate your conclusions from Question 8. Explain your reasoning. 10. What are the confidence limits at the 5% level? How do they agree with your conclusions of Questions 8 and 9? 11. Using the critical value approach, at a 10% significance level, is there evidence to suggest that the number of undocumented Mexicans living in the United States is different from the indicated data? 12. Using the p-value approach are you able to corroborate your conclusions from Question 11. Explain your reasoning.

5

“Shots across the border,” The Economist, 14 January 2006, p. 53.

294

Statistics for Business

13. What are the confidence limits at the 10% level? How do they agree with your conclusions of Questions 11 and 12? 14. What are your comments about the difficulty in carrying out this hypothesis test?

**16. Case: Socrates and Erasmus
**

Situation

The Socrates II European programme supports cooperation in education in eight areas, from school to higher education, from new technologies, to adult learners. Within Socrates II is the programme Erasmus that was established in 1987 with the objective to facilitate the mobility of higher education students within European universities. The programme is named after the philosopher, theologian, and humanist, Erasmus of Rotterdam (1465–1536). Erasmus lived and worked in several parts of Europe in quest of knowledge and experience believing such contacts with different cultures could only furnish a broad knowledge. He left his fortune to the University of Basel and became a precursor of mobility grants. The Erasmus programme has 31 participating countries that include the 25 member states of the European Union, the three European Economic area countries of Iceland, Liechtenstein, and Norway, and the current three candidate countries – Romania, Bulgaria, and Turkey. The programme is open to universities for all higher education programmes including doctoral courses. In between the academic years 1987–1988 to 2003–2004 more than 1 million university students had spent an Erasmus period abroad and there are 2,199 higher education institutions participating in the programme. The European Union budget for 2000–2006 is €950 million of which about is €750 million is for student grants. In the academic year 2003–2004, the Erasmus students according to their country of origin and their country of study, or host country is given in the cross-classification Table 1 and the field of study for these students according to their home country is given in Table 2. It is the target of the Erasmus programme to have a balance in the gender mix and the programme administrators felt that the profile for subsequent academic years would be similar to the profile for the academic year 2003–2004.6

Required

A sample of random data for the Erasmus programme for the academic year 2005–2006 was provided by the registrar’s office and this is given in Table 3. Does this information bear out the programme administrator’s beliefs if this is tested at the 1%, 5%, and 10% significance level for a difference?

6

http://europa.eu.int:comm/eduation/programmes/socrates/erasmus/what-en.html

Table 1

Subject

**Students by field of study 2003–2004 according to home country.
**

AT BE BG CY 0 0 0 7 24 3 0 0 15 0 0 12 0 3 0 CZ 187 168 182 584 228 481 90 148 464 185 123 222 113 309 14 DK 18 54 60 364 74 112 27 141 346 103 20 115 33 171 44 EE FI FR 398 519 651 6,573 320 2,833 259 598 3,321 1,449 570 399 843 1,787 295 DE 181 762 906 5,023 535 1,376 433 1,048 3,528 1,474 803 1,021 879 2,067 425 GR 81 149 143 306 81 143 46 131 327 191 104 172 87 343 38 HU 136 75 114 450 126 147 66 64 248 159 64 125 29 200 23 IS 3 0 24 56 22 20 3 13 47 7 4 4 3 15 0 IE 3 30 90 593 24 52 12 51 305 142 45 46 62 210 32 IT 317 877 756 1,963 267 1,545 206 1,144 3,346 1,455 392 1,045 453 2,220 723 LV 14 9 31 88 27 10 14 13 21 7 13 8 6 38 5

Agricultural sciences 37 156 51 Architecture, Planning 128 163 32 Art and design 193 209 42 Business studies 1,117 1,089 97 Education, Teacher training 260 414 12 Engineering, Technology 248 384 133 Geography, Geology 32 28 12 Humanities 147 105 14 Languages, Philological sciences 505 603 73 Law 231 357 37 Mathematics, Informatics 146 139 86 Medical sciences 144 349 60 Natural sciences 143 51 33 Social sciences 250 500 48 Communication and information 112 212 19 science Other areas 28 30 2 Total 3,721 4,789 751

6 64 12 30 47 326 47 1,383 2 100 22 487 9 33 9 136 51 316 28 117 4 108 12 291 4 93 32 307 12 100

Chapter 8: Hypothesis testing of a single population

0 91 4 8 60 166 227 43 32 0 8 120 4 64 3,589 1,686 305 3,951 20,981 20,688 2,385 2,058 221 1,705 16,829 308

295

Philological sciences 0 92 14 7 253 84 675 334 451 84 97 2.070 5.089 275 1.276 3.388 1.211 176 232 0 Natural sciences 0 43 7 4 51 22 361 216 206 29 2 1.482 135.296 Statistics for Business Table 1 Subject (Continued).602 4. Informatics 0 65 0 1 55 35 301 87 176 23 3 674 46 92 0 Medical sciences 0 85 8 32 219 142 247 407 209 71 6 1.138 29.156 6.109 424 269 0 Geography.413 195 754 1 Mathematics. Teacher training 0 56 43 11 354 92 126 215 47 15 17 602 69 163 0 Engineering.667 7.539 10 .005 682 546 20.062 84 220 0 Social sciences 0 97 19 5 992 137 928 487 355 29 65 1.244 902 1.589 1.314 2.568 121 2. Geology 0 25 8 2 84 5 158 66 147 10 6 450 31 88 0 Humanities 0 33 2 1 81 39 171 60 116 22 12 654 48 206 8 Languages.782 3.893 6. LI LT LU MT NL NO PL PT RO SK SI ES SE UK EUI Total 2.171 9.875 0 Law 0 87 6 31 303 77 429 190 98 25 51 1.139 14.332 0 Education.194 138 119 4.586 Agricultural sciences 0 48 0 0 80 27 112 69 61 37 23 566 19 23 0 Architecture.215 21.717 4.350 5.342 386 290 169 146 3.179 7.034 2.326 14.214 3.187 4.701 313 585 1 Communication and information 0 17 1 5 264 10 68 155 54 3 19 800 56 83 0 science Other areas 0 16 1 0 85 11 53 162 40 7 2 221 29 32 0 Total 19 1. Technology 0 189 6 9 224 112 752 479 604 106 35 3. Planning 9 37 4 2 109 19 321 264 64 18 24 854 64 96 0 Art and design 0 63 4 3 145 69 232 205 87 34 38 905 90 489 0 Business studies 10 241 15 6 1.

743 LV 5 4 Home Country Austria Belgium Bulgaria Cyprus Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Italy Latvia Liechtenstein Lithuania Luxembourg Malta Netherlands Norway Poland Portugal Rumania Slovakia Slovenia Spain Sweden United Kingdom EUI* Total 2 35 21 25 1 3 162 169 171 20 12 14 23 47 2 9 3 23 1 Chapter 8: Hypothesis testing of a single population 6 8 1 7 26 86 2 28 5 129 29 4 0 0 1 0 8 0 0 0 0 8 2 0 44 0 103 0 0 7 0 3 0 6 0 5 11 0 5 90 0 0 4 62 169 38 107 1.127 1 16.932 FR 528 768 136 9 510 260 42 413 3.161 BE 79 46 0 134 44 10 148 420 330 140 98 4 47 633 27 0 70 1 5 184 29 358 250 163 50 30 1.Table 2 Erasmus students 2003–2007 by home country and host country. Code AT BE BG CY CZ DK EE FI FR DE GR HU IS IE IT LV LI LT LU MT NL NO PL PT RO SK SI ES SE UK EUR AT 105 52 1 211 70 16 229 361 387 71 110 10 35 339 8 0 49 17 4 98 50 159 53 38 44 59 298 142 143 2 3.593 49 0 59 0 11 0 4 HU 30 28 IS 15 3 IE 132 121 6 0 43 36 2 111 1.553 426 1.997 420 276 26 557 2.396 CY 5 1 CZ 51 51 DK 104 84 14 2 103 EE 7 5 FI 227 218 16 14 241 5 47 727 918 116 201 1 40 367 42 3 180 1 6 275 15 310 95 33 52 24 501 24 233 4.513 BG 3 11 0 2 5 9 17 6 9 10 7 8 19 126 206 207 63 19 37 500 410 45 44 54 30 357 13 2 145 2 2 158 53 362 63 29 11 19 573 25 136 3.874 GR 30 75 62 13 78 13 6 72 218 165 42 3 12 180 2 18 0 0 42 15 122 53 87 24 6 178 17 60 1.804 356 566 40 292 1.870 295 457 191 125 2.125 80 62 3.994 111 1 294 39 6 391 190 1.081 926 27 15 2 230 2 1 10 0 6 88 17 74 19 21 2 1 513 80 21 3.054 42 117 4.587 IT 461 467 39 3 180 111 26 190 1.859 18 77 27 3 543 156 855 325 1.275 DE 262 306 227 4 931 302 59 654 2.298 12 10 8 166 67 28 31 951 21 9 9 199 1 3 1 65 297 .550 1.412 484 2.755 248 227 16 109 9 67 9 52 256 85 481 713 448 58 56 4.303 4 20.250 137 740 12.

459 59 3 10 536 32 0 16 181 22 6 6 201 3.276 3.298 Statistics for Business Table 2 (Continued).082 UK 410 341 44 8 317 330 8 552 4.686 305 3.076 SE 305 149 9 5 163 30 26 101 1. .705 16.829 308 19 1.636 24.721 4.782 3.058 221 1.511 7 5 22 16 22 635 159 337 178 86 32 29 2.667 7.156 6.733 NO 82 40 PL 22 69 PT 60 207 34 2 189 15 4 58 288 283 90 42 1 18 766 4 2 51 6 2 93 36 222 119 30 30 992 25 97 3.115 4.388 1.688 9 61 14 3 907 231 546 920 285 59 63 370 1.951 20.981 20.653 109 58 2 57 399 32 1 120 3 1 389 42 286 95 42 17 17 670 238 1 6.766 RO 8 30 SK 6 10 SI 16 9 ES 631 1.539 2 10 16.159 139 109 13 37 1.974 494 Total Home Country Austria Belgium Bulgaria Cyprus Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Italy Latvia Liechtenstein Lithuania Luxembourg Malta Netherlands Norway Poland Portugal Rumania Slovakia Slovenia Spain Sweden United Kingdom EUI* Total 3 15 25 49 1 4 16 43 28 5 27 15 246 463 17 12 60 314 395 14 5 13 167 27 3 22 30 26 0 5 29 40 24 2 8 6 1 1 1 4 28 5 71 8 156 10 174 129 29 3 20 0 0 0 1 0 10 0 26 0 0 0 3 0 18 0 140 1 21 0 125 0 14 0 68 0 3 0 7 0 5 0 14 4 38 0 0 0 11 24 11 3 218 0 0 0 14 9 11 12 253 200 22 69 1.385 2. Code AT BE BG CY CZ DK EE FI FR DE GR HU IS IE IT LV LI LT LU MT NL NO PL PT RO SK SI ES SE UK EUR LI 1 0 LT 12 7 LU 0 3 MT 14 13 NL 215 377 23 203 117 10 377 891 862 106 145 13 110 607 24 4 30 0 7 78 294 250 72 29 25 1.523 176 24 42 1.688 2.589 1.652 3.005 682 546 20.194 138 119 4. Florence.062 1.287 43 3 286 259 30 479 5.325 374 125 36 291 5.628 135.789 751 64 3.263 236 365 6.034 2.586 * European University Institute.

Philological sciences Business studies Business studies Gender M M F F M M M F M M M F F M F F M M F M F F F F M M F M F F F M F F Family name Algard Alinei Andersen Bay Bednarczyk Berberich Berculo Engler Ernst Fouche Garcia Guenin Johannessen Justnes Kauffeldt Keddie Lorenz Mallet Manzo Margineanu Miechowka Mynborg Napolitano Neilson Ou Rachbauer Savreux Seda Semoradova Torres Ungerstedt Ververken Viscardi Zawisza . Technology Humanities Architecture. Informatics Business studies Business studies Business studies Agricultural sciences Engineering.Chapter 8: Hypothesis testing of a single population 299 Table 3 Sample of Erasmus student enrollments for the academic year 2005–2006. Technology Social sciences Law Engineering. Teacher training Engineering. Geology Business studies Education. Teacher training Communication and information science Humanities Business studies Languages. First name Erik Gratian Birgitte Brix Hilde Tomasz Rémi Ruwan Dorothea Folker Elie Miguel Aurélie Sanne Lyng Petter Ane Katrine Nikki Jan Sebastian Guillaume Margherita Florin Anne Sophie Astrid Silvia Alison Kalvin Thomas Margaux Jiri Petra Maria Teresa Malin Alexander Alessandra Katarzyna Home country Norway Rumania Denmark Norway Poland Germany Netherlands Germany Germany France Spain France Denmark Norway Denmark United Kingdom Germany France Italy Rumania France Denmark Italy United Kingdom France Austria France Czech Republic Czech Republic Spain Sweden Belgium Italy Poland Study area Business studies Business studies Engineering. Technology Mathematics. Planning Business studies Education. Philological sciences Business studies Mathematics. Technology Business studies Geography. Informatics Agricultural sciences Natural sciences Humanities Law Languages.

This page intentionally left blank .

women earn 15% less than men for the same work. . the European Commission brought out its own report on the pay gap across the whole European Union. According to the WWC report the gender pay gap opens early. Boys and girls study different subjects in school. They then work in different sorts of jobs. the difference in median pay between men and women is around 20%.1 How do we compile this type of statistical information? We can use hypothesis testing for more than one type of population – the subject of this chapter. The Economist. As a result. Also in February. According to the report. average hourly pay for a woman at the start of her working life is only 91% of a man’s. even though nowadays she is probably better qualified. 1 “Women’s pay: The hand that rocks the cradle”.Hypothesis testing for different populations 9 Women still earn less than men On 27 February 2006 the Women and Work Commission (WWC) published its report on the causes of the “gender pay gap” or the difference between men’s and women’s hourly pay. Its findings were similar in that on an hourly basis. p. 4 March 2006. In the United States. British women in full-time work currently earn 17% less per hour than men. 33. and boy’s subjects lead to more lucrative careers.

A company wants to know if there is a significant difference in the productivity of the employees in one country and another country. are they essentially the same. That is. A company wishes to know if sales volume of a certain product in one store is different from another store in a different location. as for example the following: ● ● ● salaries of men and the salaries of women in his multinational firm. we presented by sampling from a single population. A human resource manager wants to know if there is a significant difference between the In these cases we are not necessarily interested in the specific value of a population parameter but more to understand something about the relation between the two parameters from the populations. or is there a significant difference? . The subtopics of these themes are as follows: ✔ ✔ ✔ ✔ Difference between the mean of two independent populations • Difference of the means for large samples • The test statistic for large samples • Application of the differences in large samples: Wages of men and women • Testing the difference of the means for small sample sizes • Application of the differences in small samples: Production output Differences of the means between dependent or paired populations • Application of the differences of the means between dependent samples: Health spa Difference between the proportions of two populations with large samples • Standard error of the difference between two proportions • Application of the differences of the proportions between two populations: Commuting Chi-square test for dependency • Contingency table and chi-square application: Work schedule preference • Chi-square distribution • Degrees of freedom • Chi-square distribution as a test of independence • Determining the value of chi-square • Excel and chi-square functions • Testing the chi-square hypothesis for work preference • Using the p-value approach for the hypothesis test • Changing the significance level In Chapter 8.302 Statistics for Business Learning objectives After you have studied this chapter you will understand how to extend hypothesis testing for two populations and to use the chi-square hypothesis test for more than two populations. how we could test a hypothesis or an assumption about the parameter of this single population. A firm wishes to know if there is a difference in the absentee rate of employees in the morning shift and the night shift. In this chapter we look at hypothesis testing when there is more than one population involved in the analysis. A professor of Business Statistics is interested to know if there is a significant difference between the grade level of students in her morning class and in a similar class in the afternoon. ● ● Difference Between the Mean of Two Independent Populations The difference between the mean of two independent populations is a hypothesis test to sample in order to see if there is a significant difference between the parameters of two independent populations.

When the – – value of x 1 is less than x 2 then the result of equation 9(i) value is negative. x1 x2 9(i) – – When the value of x 1 is greater than x 2 then the result of equation 9(i) is positive. The figure on the left gives the normal distribution for Population No. Distribution of population No. Underneath the respective distributions are the sampling distributions of the means taken from that population.Chapter 9: Hypothesis testing for different populations 303 Figure 9. 2 mx 1 m1 mx 2 m2 Difference of the means for large samples The hypothesis testing concept between two population means is illustrated in Figure 9. The difference between the values of the sample means is then given as.2.1 Two independent populations.1. for example. that we take a random sample from – Population 1. μx 1 x2 μx 1 μx 2 9(ii) When the mean of the two populations are equal – – then μx 1 μx 2 0. 2 Mean m 2 Standard deviation s1 s2 f(x) Sampling distribution from population No. The mean of the sample distribution of the differences of the mean is written as. 1 Sampling distribution from population No. Similarly we take a random sample from Popula– tion 2 and this gives a sample mean of x 2. Assume. 1 and the figure on the right gives the normal distribution for Population No. 1 f(x) Mean m 1 Standard deviation Distribution of population No. . 2. which is then the difference between the values of sample means taken from the respective populations. If we construct a distribution of the difference of the entire sample means then we will obtain a sampling distribution of the differences of all the possible sample means as shown in Figure 9. From the data another distribution can be constructed. which gives a sample mean of x 1.

This relationship is also the standard error of the difference between two means. the standard deviation of the distribution of the difference between the sample means. (x1 x2 ) (μ1 ˆ2 σ1 n1 ˆ2 σ2 n2 μ2 )H 1 x2 9(iii) z 0 9(vi) where σ2 and σ2 are respectively the variance of 1 2 Population 1 and Population 2. when we have just one population. then we use the sample standard deviation. the test statistic z for large samples. using the central limit theory we developed the following relationship for the standard error of the sample mean: σx σx n When we test the difference between the means of two populations then the equation for the test statistic becomes.2 Distribution of all possible values of difference between two means. is given by the relationship. (x1 x2 ) (μ1 2 σ1 2 σ2 6(ii) z μ2 )H 0 9(v) Extending this relationship for sampling from two populations. as an estimate – – In this equation. then equation 9(v) becomes. is determined from the following relationship: σx 2 σ1 n1 2 σ2 n2 n1 n2 Alternatively. Standard error of difference x 1 x 2 Distribution of all possible values of X1 X2 1 x2 The test statistic for large samples From Chapter 6. If we do not know the population standard deviations.2. that is greater than 30. The following application example illustrates this concept. In this case the estimated standard deviation of the distribution of the difference between the sample means is. ˆ σx ˆ2 σ1 n1 ˆ2 σ2 n2 9(iv) Figure 9. if we do not know the population standard deviation.304 Statistics for Business ^ of the population standard deviation σ. . s. as given in Figure 9. (x 1 x 2) is the difference between the sample means taken from the population and (μ1 μ2)H0 is the difference of the hypothesized means of the population. σ1 and σ2 are the standard deviations and n1 and n2 are the sample sizes taken from these two populations. Mean x1 x2 z x σx μx n 6(iv) From Chapter 6.

1 Difference in the wages of men and women. is there evidence of a difference between the wages of men and women? At a 10% significance level we are asking the question is there a difference. of 1. z 0. In this example the sample value of z 1. women Population 2. s ($) 2.50 and μ1 μ2 0 since the null hypothesis . The null and alternative hypotheses are as follows: ● is that there is no difference between the population means. This is the same conclusion as previously. 1. Since we have only a measure of the sample standard deviation s and not the population standard deviation σ. Testing the difference of the means for small sample sizes When the sample size is small. Using [function NORMSINV] in Excel the critical value of z is 1. This is a two-tail test with 5. At a 10% significance level.50 0. Since 2.3. As discussed in Chapter 8 we can also use the p-value approach to test the hypothesis.8886 and using [function NORMSDIST] gives an area in the tail of 2.40 1. men 28.15 Application of the differences in large samples: Wages of men and women A large firm in the United States wants to know.8886 1. H1: μ1 μ2 is that there is a significant difference in the wages. x 1 x 2 28.2648 Thus.902 140 1 x2 ● Null hypothesis. The representation of this worked example is illustrated in Figure 9. H0: μ1 μ2 is that there is no significant difference in the wages.65 29.95% is less than 5% we reject the null hypothesis. n 130 140 Population 1.1. or test statistic. or less than 30 units.6449.402 130 0.6449 we reject the null hypothesis and conclude that there is evidence to indicate that the wages of women are significantly different from that of men.0% in each of the tails.Chapter 9: Hypothesis testing for different populations 305 Table 9. then to be correct we must use the – – Here. The standard error of the difference between the means is from equation 9(iv): ˆ σx ˆ2 σ1 n1 ˆ2 σ2 n2 2. we use equation 9(vi) to determine the test or sample statistic z: z (x1 x2 ) (μ1 μ2 )H ˆ2 σ1 n1 ˆ2 σ2 n2 0 Since the sample.95%. Sampling the employees gave the information in $US in Table 9.90 Sample size. the relationship between the wages of men and women employed at the firm. Sample mean – ($) x Sample standard deviation.65 29.8886 is less than the critical value of 1.2648 1.15 0. which means to say that values can be greater or less than. Alternative hypothesis.

on the assumption that the two population variances . (n1 1) (n2 1) (n1 n2 2) 9(viii) Figure 9. When we use the Student-t distribution the population standard deviation is unknown. equation 9(vii) becomes as follows: s2 p 2 (n1 1)s1 (n1 1) (n2 (n2 (n1 (n1 2 1)s2 1) 2 1)s2 1) 2 (s1 2 s2 ) (n1 1)s1 s2 (n1 1) 2 2 (n1 1)(s1 s2 ) (n1 1)(1 1) 2 9(xi) 9(vii) Further. s2.6449 5.95% z 1. 1 taken from Population 1 can be pooled. is given by. p s2 p 2 (n1 1)s1 (n1 1) 2 (n2 1)s2 (n2 1) If we take samples of equal size from each of the populations. can be rewritten as. then since n1 n2.00% 0 This is so because we now have two samples and thus two degrees of freedom.306 Statistics for Business are equal. Note. that the denominator in equation 9(vii). ⎛1 ⎜ ⎜ ⎜n ⎜ ⎝ 1 1⎞ ⎟ ⎟ ⎟ n2 ⎟ ⎠ ⎛1 ⎜ ⎜ ⎜n ⎜ ⎝ 1 1⎞ ⎟ ⎟ ⎟ n1 ⎟ ⎠ 2 n1 9(xii) This value of s2 is now the best estimate of the p variance common to both populations σ2. Thus to estimate the standard error of the difference between the two means we use equation 9(iv): ˆ σx ˆ2 σ1 n1 ˆ2 σ2 n2 9(iv) Then by analogy with equation 9(vi) the value of the Student-t distribution is given by.8886 1. or com2 bined. This value of the p pooled estimate s2 is given by the relationship. when there are small samples on the assumption that the population variances are equal. 1 2 or σ2 σ2. σ2. ˆ σx sp 1 n1 1 n2 9(ix) Complete area 1 x2 Student-t distribution. Area to left 2. the relationship in the denominator of equation 9(x) can be rewritten as. This then enables us to use a pooled 1 2 variance such that the sample variance. to give a value s2. Combining equations 9(iv) and 9(vii) the relationship for the estimated standard error of the difference between two sample means. σ2 is equal to the variance of Population 2. with s2. Note that in Chapter 8 when we took one sample of size n in order to use the Student-t distribution we had (n 1) degrees of freedom. t (x1 x2 ) (μ1 ⎛1 s2 ⎜ p⎜ ⎜ ⎜n ⎝ μ2 )H 1⎞ ⎟ ⎟ ⎟ n2 ⎟ ⎠ 0 9(x) 1 1 x2 However.3 Difference in wages between men and women. a difference from the hypothesis testing of large samples is that here we make the assumption that the variance of Population 1.

4548 Sample size. This is then a one-tail test with 1% in the upper tail. night 27. ● Application of the differences in small samples: Production output One part of a car production firm is the assembly line of the automobile engines.45482 (13 1) . s2 p 2 (n1 1)s1 (n1 1) (n2 (n2 2 1)s2 1) (16 1) * 2. n 16 13 Population 1.3. is there evidence that the output of engines on the morning shift is greater than that on the evening shift? At a 1% significance level we are asking the question is there evidence of the output on the morning shift being greater than the output on the night shift. 1. morning Population 2.1250 24. ● The null hypothesis is that there is no difference in output.2 Production output between morning and night shifts. Using [function TINV] gives a critical value of Student-t 2. This information is given in Table 9. evening 15:00–23:00 hours. and 13 days for the night shift. Morning (m) Night (n) 29 22 24 23 28 21 29 25 31 31 27 22 29 28 28 30 26 20 23 22 25 23 28 25 27 26 27 30 23 Table 9. (x1 x2 ) (μ1 ⎛ s2 s2 ⎞ ⎜ 1 2⎟ ⎟ ⎜ ⎟ ⎜ n ⎟ ⎜ ⎠ ⎝ 1 μ2 )H t 0 9(xiii) The use of the Student-t distribution for small samples is illustrated by the following example.3 Production output between morning and night shifts. the firm employs three shifts: morning 07:00–15:00 hours. s 2. From the sample data we have the information given in Table 9.39102 (16 1) 8. – Sample mean x Sample standard deviation. H1: μM μN.4808 (13 1) * 3.2. and the night shift 23:00–07:00 hours.3910 3.4727. In this area of the plant. The manager of the assembly line believes that the production output on the morning shift is greater than that on the night shift. At a 1% significance level.Chapter 9: Hypothesis testing for different populations 307 Table 9.4615 Thus equation 9(x) can be rewritten as. H0: μM μN The alternative hypothesis is that the output on the morning shift is greater than that on the night shift. From equation 9(vii). Before the manager takes any action he first records the output on 16 days for the morning shift.

(x1 x2 ) (μ1 μ2 )H 1⎞ ⎟ ⎟ ⎟ n2 ⎟ ⎠ our conclusion is the same in that we accept the null hypothesis. This is greater than 1.4808⎜ ⎜ ⎜ 16 13 ⎟ ⎟ ⎠ ⎝ 2. the area in the tail for the sample is still 1.4494. How would your conclusions change if a 5% level of significance were used? In this situation nothing happens to the sample or test value of the Student-t which remains at 2. Shaded area 1.4494 is less than the critical value t of 2.7033 we concluded that at a 5% level the production output in the morning shift is significantly greater than that in the night shift.05% 0 t 1.00% Shaded area 5.05%.05%.1250 24.5 Production output between the morning and night shifts.4727 Total area 1. 2. If we use the p-value approach for this hypothesis test then using [function TDIST] in Excel for a one-tail test.4494 1. This new concept is illustrated in Figure 9. This is less than 5.4.4494 2.4 Production output between the morning and night shifts.4727 we conclude that there is no significant difference between the production output in the morning and night shifts.7033 2. However.00% and so . t 0 ⎛1 s2 ⎜ p⎜ ⎜n ⎜ ⎝ 1 27.308 Statistics for Business Figure 9. The concept of this worked example is illustrated in Figure 9.7033.4494 1. now we have 5% in the upper tail and using [function TINV] gives a critical value of Student-t 1.00% 0 t 2. Figure 9. If we use the p-value approach for this hypothesis test then using [function TDIST] in Excel for a one-tail test then the area in the tail for the sample information is 1.00% and so our conclusion is the same that we reject the null hypothesis. Since 2.4615 0 ⎛ 1 1⎞ ⎟ ⎟ 8.05% Area to left of line From equation 9(x) the sample or test value of the Student-t value is.4494 Since the sample test value of t of 2.5.

The weights of all participants in the programme are recorded each time they come to the spa. The test is now very similar to hypothesis testing for a single population since we are making our analysis just on the difference. kg (2) 120 101 95 87 118 97 92 82 132 121 102 87 87 74 92 84 115 109 98 87 109 100 110 101 95 82 Differences of the Means Between Dependent or Paired Populations In the previous section we discussed analysis on populations that were essentially independent of each other.5. We are interested not in the weights before and after but in the difference of the weights and thus we can extend Table 9. or H1: μ 10 kg. At a significance level of 5% all of the area lies in the right-hand tail.4 to give the information in Table 9.4 Health spa and weight loss. is there evidence that the weight loss of participants in this programme is greater than 10 kg? Here the null hypothesis is that the weight loss is not more than 10 kg or H0: μ 10 kg. . kg (1) After. At a 5% significance level. The purpose of these tests is to see if improvements have been achieved as a result of a new action. productivity improvement after an employee training programme. Before. – x (Difference) s σ ˆ 11. Belgium advertises a combined fitness and diet programme Estimated standard error of the mean is σx ˆ/ σ n 4.7692 kg and 4. The alternative hypothesis is that the weight loss is more than 10 kg. In the wage example we chose samples from a population of men and a population of women. The following application illustrates.4. From the table. The analytical procedure is to consider statistical analysis on the difference of the values since there is a direct relationship rather than the values before and after.3999 Application of the differences of the means between dependent samples: Health spa A health spa in the centre of Brussels. The authorities are somewhat sceptical of the advertising claim so they select at random 13 of the regular participants and their recorded weights in kilograms before and after 6 months in the programme are given in Table 9. often in a before and after situation. or sales increases of a certain product after an advertising campaign. where it guarantees that participants who are overweight will lose at least 10 kg in 6 months if they scrupulously follow the course. In the production output example we looked at the population of the night shift and the morning shift. 1. Sometimes in sampling experiments we are interested in the differences of paired samples or those that are dependent or related. Using [function TINV] gives a critical value of Student-t 1.Chapter 9: Hypothesis testing for different populations 309 Table 9.2203.3999/ 13 1. Examples might be weight loss of individuals after a diet programme.7823. When we make this type of analysis we remove the effect of other variables or extraneous factors in our analysis.

Would your conclusions change if you used a 10% significance level? In this case at a significance level of 10% all of the area lies in the right-hand tail and using [function TINV] gives a critical value of Student-t 1. kg (1) After.64% 0 t 1.7.4498 is less than the critical value of t of 1.64%. 11.3562 and thus we reject the null hypothesis and conclude that the publicity for the programme is correct and that the average weight loss is greater than 10 kg.7823 we accept the null hypothesis and conclude that based on our sampling experiment that the weight loss in this programme over a 6-month period is not more than 10 kg. kg 120 101 19 95 87 8 118 97 21 92 82 10 132 121 11 102 87 15 87 74 13 92 84 8 115 109 6 98 87 11 109 100 9 110 101 9 95 82 13 Figure 9. the area in the tail for sample information is 8. This is less than 10.6. Now.7823 8.5 Health spa and weight loss.6 Health spa and weight loss.00% and so our conclusion is the same in that we accept the null hypothesis. The sample or test value of the Student-t remains unchanged at 1.4498 1.4498 10. . Shaded area 5.00% Area to right of line 8. 1.64% 0 t 1. the area in the tail is still 8.64%. Figure 9. If we use the p-value approach for this hypothesis test then using [function TDIST] in Excel for a one-tail test. x t ˆ σ μH n 0 The concept for this is illustrated in Figure 9.7692 1.00% and so our conclusion is the same in that we reject the null hypothesis. 2.310 Statistics for Business Table 9. This concept is illustrated in Figure 9.4498 1.3562 1. This is greater than 5.3562.00% Area to right of line Complete shaded area Sample.7692 10 1.2203 1.4498 Since this sample value of t of 1. If we use the p-value approach for this hypothesis test then using [function TDIST] in Excel for a one-tail test. Before. or test value of Student-t is.2203 1. kg (2) Difference.4498.7 Health spa and weight loss.

22 May 2006.Chapter 9: Hypothesis testing for different populations Again as in all hypotheses testing. Then by analogy with equation 9(iii) for the difference in the standard error for the means we have the equation for the standard error of the difference between two proportions as.2) Is there a significant difference between the percentage effectiveness of one drug and another drug for the same ailment? In these situations we take samples from each of the two groups and test for the percentage difference in the two populations. International Herald Tribune. q1 are respectively the proportion of success and failure and n1 is the sample size taken from the first population and p2. 2 “Compared with Americans. σp p1q1 n1 p2 q2 n2 9(xiv) 311 Difference Between the Proportions of Two Populations with Large Samples There are situations we might be interested to know if there is a significant difference between the proportion or percentage of some criterion of two different populations. q 2 are the values of the proportion of successes and failures taken from the sample. If we do not know the population proportions then the estimated standard error of the difference between two proportions is. p 1. ˆ σp p1q1 n1 p2q2 n2 9(xv) 1 p2 – – – – Here. remember that the conclusions are sensitive to the level of significance used in the test. the value of z for the difference in the hypothesis for two population proportions is. the British are the picture of health”. where n is the sample size. 1 p2 where p1. In Chapter 8 we developed that the number of standard deviations. Application of the differences of the proportions between two populations: Commuting A study was made to see if there was a significance difference between the commuting time . z ( p1 p2 ) ( p1 ˆ σp 1 p2 )H 0 9(xvi) p2 The use of these relationships is illustrated in the following worked example. p z pH σp 0 8(ix) Standard error of the difference between two proportions In Chapter 6 (equation 6(xi)) we developed the following equation for the standard error of the proportion. and q is the population proportion of failures equal to (1 p). p 2. z. p is the population proportion of successes. q2. σ –: p σp pq n p(1 p) n 6(xi) By analogy. q 1. The procedure behind the test work is similar to the testing of the differences in means except rather than looking at the difference in numerical values we have the differences in percentages. For example. is there a significant difference between the percentage output of one firm’s production site and the other? Is there a difference between the health of British people and Americans? (The answer is yes. 7. in hypothesizing for a single population proportion as. p. and n2 are the corresponding values for the second population. according to a study in the Journal of the American Medical Association.

is there evidence to suggest that the proportion of people commuting . and so again we accept the null hypothesis.8 Commuting time.9181 0 0.5894 and ( p1 Figure 9. p2 q2 127/250 1 0.50% From equation 9(xvi) the sample value of z is. so there is 2.9600. At a 5% significance level.4920 This is a two-tail test since we are asking the question is there a difference? ● ● Null hypothesis is that there is no difference or H0: p1 p2 Alternative hypothesis is that there is a difference or. Using [function NORMSINV] gives a critical value of z of 1. The benchmark for commuting time was at least 2 hours per day.4106 302 0.5080 0.312 Statistics for Business of people working in downtown Los Angeles in Southern California and the commuting time of people working in downtown San Francisco in Northern California.50% the critical value. A random sample of 250 people was selected in San Francisco and 127 replied that they had a commute of at least 2 hours.50% in each tail.5894 * 0.9600 we accept the null hypothesis and conclude that there is no significant difference between commuting time in Los Angeles and San Francisco. A random sample of 302 people was selected from Los Angeles and 178 said that they had a daily commute of at least 2 hours. p1 q1 178/302 1 0.9181 1. 2.5894 0.8.75% 2.75% 0 z 1. H1: p1 p2 From equation 9(xv) the estimated standard error of the difference between two proportions is. ˆ σp 1 p2 p1q1 n1 p2q2 n2 0. 1. At a 5% significance level.9181 1.4106 ( p1 p2 p2 ) H 0 Sample proportion of people commuting at least 2 hours to San Francisco is.4920 250 0. z p2 ) ˆ σp 1 0. is there evidence to suggest that the proportion of people commuting Los Angeles is different from that of San Francisco? Sample proportion of people commuting at least 2 hours to Los Angeles is.5050 * 0. Total area to right 2.5894 0.5080) 0. Using [function NORMSDIST] for a sample value z of 1.75%.9181 the area in the upper tail is 2.9600 Area 2. We obtain the same conclusion when we use the p-value for making the hypothesis test. Since 1.0424 This is a two-tail test at 5% significance. This area of 2. This concept is illustrated in Figure 9.0424 1.5080 and (0.

Thus. Now this value is less than the 5% significant value Contingency table and chi-square application: Work schedule preference We have already presented a contingency or cross-classification table. or that the difference is only due to chance. This new situation is illustrated in Figure 9. If it is not significant. z 0 1. If this difference is considered significant then a conclusion may be that location affects the way people behave.9181.9181 1. two proportions. then the difference is just due to chance. that a sample survey on the proportion of people in certain states of the United States who exercise regularly was found to be 51% in California. 313 Figure 9. or alternatively. the area in the upper tail corresponding to a sample test value of 1. for example.9 Commuting time. The sample test value of z remains unchanged at 1. ● ● Null hypothesis is that there is a population not greater or H0: p1 p2 Alternative hypothesis is that a population is greater than or.9181 2. If we have sample data which give proportions from more than two populations then a chi-square test can be used to draw conclusions about the populations. Suppose. assuming a firm is considering marketing a new type of jogging shoe then if there is a significant difference between states. its marketing efforts should be weighted more on the state with a higher level of physical fitness.Chapter 9: Hypothesis testing for different populations and so the conclusion is the same that there is evidence to suggest that the commuting time for those in Los Angeles is greater than for those in San Francisco.9. However. using [function NORMSDIST] the 5% in the upper tail corresponds to a critical z-value of 1. H1: p1 p2 Here we use since less than or equal is not greater than and so thus satisfies the null hypothesis. Using the p-value approach. The chi-square test enables us to decide whether the differences among several sample proportions is significant. 45% in New York.75% Los Angeles is greater than those working in San Francisco? This is a one-tail test since we are asking the question.00% Chi-Square Test for Dependency In testing samples from two different populations we examined the difference between either two means. Since the value of 1. and 29% in South Dakota.6449 Area to right 1.6449 we reject the null hypothesis and conclude that there is statistical evidence that the commuting time for Los Angeles people is significantly greater than for those persons in San Francisco. in Chapter 2.75%. is one population greater than the other? Here all the 5% is in the upper tail. Area 5. This table .6449. The chi-square test will be demonstrated as follows using a situation on work schedule preference. 34% in Ohio.9181 is still 2.

written χ2 where the symbol χ is the Greek letter c. For example. Italy. and 12.314 Statistics for Business Table 9. Since we are dealing with χ2. frequency of occurrence. 5-day/week work schedule and a proposed 10-hour/day. 6.10 gives three chi-square distributions for degrees of freedom.6 Preference Work preference sample data or observed frequencies. These sample values are the observed frequencies of occurrence. The x-axis is the value of chi-square. and 10.6. the columns give preference according to country and the rows give the preference according to the work schedule criteria. Degrees of freedom The degrees of freedom in a cross-classification table are calculated by the relationship.200 8 hours/day 10 hours/day Total presents data by cross-classifying variables according to certain criteria of interest such that the cross-classification accounts for all contingencies in the sampling data. f (χ 2 ) 1 1 (χ2 )( υ /2 [(υ/2 1)]! 2υ /2 1) e χ2 /2 9(xvii) Figure 9.7 which is a 3 4 contingency table as there are three rows and four columns. The row totals are given by TR1 through TR3 and the column totals by TC1 through TC4. υ. of respectively 4. The mode or the peak of the curve is equal to the degrees of freedom less two. fo. United States 227 93 320 Germany 213 102 315 Italy 158 97 255 England 218 92 310 Total 816 384 1. The sample data collected using an employee questionnaire is in Table 9. In order to test whether preference for a certain work schedule depends on the location. υ. we test using a chisquare distribution. the peak of each curve is for values of χ2 equal to 2. 4-day/week work schedule. This is a 2 4 contingency table as there are two rows and four columns. in the column designated . For small values of υ the curves are positively or right skewed. or there is simply no dependency. nor the column totals are considered in determining the dimension of the table. Assume that a large multinational company samples its employees in the United States. f (χ2) where this probability density function is given by. For example. for the three curves illustrated. respectively. the values on the x-axis are always positive and extend from zero to infinity. Germany. (Number of rows 1) * (Number of columns 1) 9(xviii) Chi-square distribution The chi-square distribution is a continuous probability distribution and like the Student-t distribution there is a different curve for each degree of freedom. fo. The y-axis is the Consider Table 9. 8. or χ to the power of two. The value of the row totals and the column totals are fixed and the “yes” or “no” in the cells indicate whether or not we have the freedom to choose a value in this cell. R1 through R3 indicate the rows and C1 through C4 indicates the columns. and England using a questionnaire to discover their preference towards the current 8-hour/day. As the value of υ increases the curve takes on a form similar to a normal distribution. In this contingency table. Neither the row totals.

In this table we have the freedom to choose only six values or the same as determined from equation 9(x).10 Chi-square distribution for three different degrees of freedom. Degrees of freedom (3 1) * (4 2*3 6 1) Chi-square distribution as a test of independence Going back to our cross-classification on work preferences in Table 9.7 Contingency table. 20% 18% 16% 14% 12% f(χ2) 10% 8% 6% 4% 2% 0% 0 2 4 6 8 10 12 14 χ2 df 4 df 8 df 12 16 18 20 22 24 26 28 Table 9.Chapter 9: Hypothesis testing for different populations 315 Figure 9. C1 C2 yes yes no TC2 C3 yes yes no TC3 C4 no no no TC4 Total rows TR1 TR2 TR3 TOTAL R1 R2 R3 Total columns yes yes no TC1 by C1 we have only the freedom to choose two values. pU is the proportion in the United States who prefer the present work schedule pG is the proportion in Germany who prefer the present work schedule . The same logic applies to the rows. the third value is automatically fixed by the total of that column.6 let as say that.

Another way of calculating the expected frequency is from the relationship. For example. Thus the complete expected data.00 Total 816.60 255.80 99. the sample size for the United States is 320 and so assuming the null hypothesis.60. f e.200 This is also saying that for the null hypothesis of the employee preference of work schedule is independent of the country of work.00 8 hours/day 10 hours/day Total pI is the proportion in the Italy who prefer the present work schedule pE is the proportion in England who prefer the present work schedule The null hypothesis H0 is that the population proportion favouring the current work schedule is not significantly different from country to country and thus we can write the null hypothesis situation as follows: H0: pU pG pI pE 9(xix) for the work schedule.00 384. fe.20 100.200 who prefer the is 384/1. on the assumption that the null hypothesis is correct is as in Table 9.40 320. United States 217.20 310.40 since the choice is one schedule or the other.6 if the null hypothesis is correct and that there is no difference in the preference TRo and TCo are the total values for the rows and columns for a particular observed frequency fo in a sample of size n.200.3200 who prefer the is 816/1. For example.60 102. then from the sample data. We then use these proportions on the sample data to estimate the population proportion that prefer the 8-hour/day or the 10-hour/day schedule.80 315.8 Preference Work preference – expected frequencies.00 England 210.3200 * 320 102.40. The estimated number that prefers the 10-hour/day schedule is 0. The alternative hypothesis is that population proportions are not the same and that the preference for the work schedule is dependent on the country of work.8. H1: pU pG pI pE 9(xx) Thus in hypothesis testing using the chi-square distribution we are trying to determine if the population proportions are independent or dependent according to a certain criterion. the alternative hypothesis H1 is written as. from Table 9. the chisquare test is also known as a test of independence. These are then considered expected frequencies. .6 let us consider the cell that gives the observed frequency for Germany for a preference of an 8-hour/day schedule. This value is also given by 320 217.6800 Population proportion 10-hour/day schedule 0.6800 * 320 217.00 Germany 214. in this case the country of employment. Thus. ● ● Population proportion 8-hour/day schedule 0.316 Statistics for Business Table 9.60 102. In this case.00 Italy 173. fe TRo * TCo n 9(xxi) Determining the value of chi-square From Table 9. This test determines frequency values as follows.00 1. the estimated number that prefers the 8-hour/day schedule is 0.40 81.

a significance level of 5% gives us a critical value of the chi-square value of 7.9064 0. fe 816 TCo 315 n 1.0067 1.16 51.40 210.40 7.80 102.60 99. Let us say for the work preference situation that we consider 5% significance.3325 TRo Thus.200. χ2. χ2 [function CHIDIST] This generates the area in the chi-distribution when you enter the chisquare value and the degrees of freedom of the contingency table.20 15.60 0.84 757.9 Work preference – observed and expected frequencies.9 gives the detailed calculations.6 we have two rows and four columns.40 100.20 0.40 1.0143 2. χ2 ∑ ( fo fe f e )2 9(xxii) where fo is the frequency of the observed data and fe is the frequency of the expected or theoretical data.20 1. Testing the chi-square hypothesis for work preference As for all hypothesis tests we have to decide on a significance level to test our assumption.3677 0. Excel and chi-square functions In Microsoft Excel there are three functions that are used for chi-square testing.00 (fo fe) 2 Total 88.84 88.9 the value of the sample chi-square as shown is.200 TRo * TCo n 816 * 315 1.20 9. [function CHIINV] This generates the chisquare value when you enter the area in the chi-square distribution and the degrees of freedom of the contingency table.2459 0.8629 0.5226 6.80 81.36 1. (fo fe )2 fe fo 227 213 158 218 93 102 97 92 1. Table 9. In Table 9. the total amount in the fo column must equal to total in the fe column and also the total ( fo fe ) must be equal to zero.Chapter 9: Hypothesis testing for different populations 317 Table 9.60 214.40 1.00 fo fe 9.44 237.36 1. is given by the relationship.3325 Note in order to verify that your calculations are correct. [function CHITEST] This generates the area in the chi-square distribution when you enter the observed frequency and the expected frequency values assuming the null hypothesis. for the chisquare test we also need the degrees of freedom.200 fe 217. thus the degrees of freedom for this table is. Degrees of freedom (2 1) * (4 1*3 3 1) ∑ ( fo fe f e )2 6.4061 0.20 The value of chi-square.20 15.8147.44 237.40 7. Using Excel [function CHIINV] for 3 degrees of freedom. .20 173. Thus from this information in Table 9.16 51. 200 214. In addition.

.65% is greater than 5.318 Statistics for Business Figure 9.8147 at the 5% significance level given. Figure 9.3325.8 This then gives the value 0. f (x 2) Total shaded area 10.3325.65% which is the area in the chisquare distribution for the observed data.11.12. Since 9.00% To left of sample x 2 9.2514 Critical value at 10% significance sample Using the p-value approach for the hypothesis test In the previous paragraph we indicated that if we use [function CHITEST] we obtain the value 9.3325 sample 7.0965 or 9.65% 6. We can avoid performing the calculations shown in Table 9. which is the area in the chi-square distribution.8147 Critical value at 5% significance 6. This concept is illustrated in Figure 9.3325 sample 7. 6.9 by using first from Excel [function CHITEST].6 and the expected frequency values fe as given in Table 9.8147 Critical value at 5% significance The positions of this critical value and the value of the sample or test chi-square value are shown in Figure 9. This is also the p-value for the observed data. is less than the critical value of 7.3325 6.65% 6.00% the significance level we accept the null hypothesis or the same conclusion as before. f (x 2) f (x 2) Area 9.65%. Since the value of the sample chi-square statistic. We then use [function CHIINV] and insert the value 9.65% and the degrees of freedom to give the sample chi-square value of 6.12 Chi-square distribution for work preferences. In this function we enter the observed frequency values fo as shown in Table 9.13 Chi-square distribution for work preferences.11 Chi-square distribution for work preferences. Figure 9. we accept the null hypothesis and say that there is no statistical evidence to conclude that the preference for the work schedule is significantly different from country to country.

In these cases we may not be interested in the mean values of one population but in the difference of the mean values of both populations. When we have small sample sizes our analytical approach is similar except that we use a pooled sampled variance and the Student-t distribution for our analytical tool.3325. ● ● Area under the chi-square distribution represented by the sampling data is 9. Alternatively. This new relationship is illustrated in Figure 9. Differences of the means between dependent or paired populations This hypothesis test of the differences between paired samples has the objective to see if there are measured benefits gained by the introduction of new programmes such as employee training to improve productivity or to increase sales. It also looks at hypothesis testing for the differences in the proportions of two populations. The last part of the chapter presented the chi-square test for examining the dependency of more than two populations.Chapter 9: Hypothesis testing for different populations 319 Changing the significance level For the work preference situation we made the hypothesis test at 5% significance.2514 (using chi-square values). etc. employee productivity in one country and another.00% (the p-value approach). if these are known. using estimates of the population standard deviation measured from the samples. 6. fitness programmes to reduce weight or increase stamina. We develop first a probability distribution of the difference in the sample means. or if they are not known.13. alternatively 9. In all cases we propose a null hypothesis H0 and an alternative hypothesis H1 and test to see if there is statistical evidence whether we should accept. We reject the null hypothesis and conclude that the country of employment has some bearing on the preference for a certain work schedule. From this we determine the standard deviation of the distribution by combining the standard deviation of each sample using either the population standard deviations.65%. the null hypothesis. In these type of hypothesis .65% 10. Sample chi-square value is 6.2514. From the sample test data we determine the sample z-value and compare this to the z-value dictated by the given significance level α. What if we increased the significance level to 10%? In this case nothing happens to our sampling data and we still have the following information that we have already generated. we can make the hypothesis test using the p-value approach and the conclusion will be the same. Chapter Summary Difference between the mean of two independent populations The difference between the mean of two independent populations is a test to see if there is a significant difference between the two population parameters such as the wages between men and women. etc. Now since. This chapter has dealt with extending hypothesis testing to the difference in the means of two independent populations and the difference in the means of two dependent or paired populations. or reject. coaching courses to increase student grades. Using [function CHIINV] for 10% significance and 3 degrees of freedom gives a chi-square value of 6. the grade point average of students in one class or another.3325 6.

The hypothesis test is based on the chi-square frequency distribution. Chi-square test for dependency The chi-square hypothesis test is used when there are more than two populations and tests whether data is dependent on some criterion. and the respective sample sizes. The first step is to develop a cross-classification table based on the sample data. The test procedure is to see whether the sample test value χ2 is greater or lesser than the critical value χ2. Alternatively we use the p-value approach and see whether the area under the curve determined from the sample data is greater or smaller than the significance level. This is (number of rows 1) * (number of columns 1). The test procedure is similar to the differences in means except rather than measuring the difference in numerical values we measure the differences in percentages. α. We then determine whether the sample z-value is greater or lesser than the critical z-value. which has a y-axis of frequency and a positive x-axis χ2 extending from zero.320 Statistics for Business test we are dealing with the same population in a before and after situation. The hypothesis test is then analogous to that for a single population. using the sample proportion of successes as our benchmark. For large samples we use a z-value for our critical test and a Student-t distribution for small sample sizes. fe. If we use the p-value approach we test to see whether the area in the tail or tails of the distribution is greater or smaller than the significance level α. In this case we measure the difference of the sample means and this becomes our new sampling distribution. There is a chi-square distribution for each degree of freedom of the cross-classification table. To perform the chisquare test we need to know the degrees of freedom of the cross-classification table of our sample data. . Assuming that the null hypothesis is correct we calculate an expected value of the frequency of occurrence. Difference between the proportions of two populations with large samples This hypothesis test is to see if there is a significant difference between the proportion or percentage of some criterion of two different populations. fo. We calculate the standard error of the difference between two proportions using a combination of data taken from the two samples based on the proportion of successes from each sample. the proportion of failures taken from each sample. This information gives the observed frequency of occurrence.

Confirm your conclusions to Question 4 using the p-value approach. Tee shirts Situation A European men’s clothing store wants to test if there was a difference in the price of a certain brand of tee shirts sold in its stores in Spain and Italy. does the data indicate that there is a significant difference in the price of tee shirts in the two countries? 3. at a 1% significance level.120. Using the critical value method. at a 2% significance level.70)2. Gasoline prices Situation A survey of 102 gasoline stations in France in January 2006 indicated that the average price of unleaded 95 octane gasoline was €1. Required 1. 6.4270 per litre with a standard deviation of €0.80)2. Confirm your conclusions to Question 4 using the p-value approach. Using the critical value method. Another sample survey taken 6 months later at 97 gasoline stations indicated that the average price of gasoline was €1. Using the critical value method would your conclusions change at a 5% significance level? 5. 4. Using the critical value method would your conclusions change at a 5% significance level? 5. 2. Confirm your conclusions to Question 2 using the p-value approach. 6.105.80 with a variance of (€2. Indicate an appropriate null and alternative hypotheses for this situation if we wanted to know if there is a significant difference in the price of gasoline. It took a sample of 41 stores in Spain and found that the average price of the tee shirts was €27. Required 1.90 with a variance of (€3. It took a sample of 49 stores in Italy and found that the average price of the tee shirts was €26. Indicate an appropriate null and alternative hypothesis for this situation if we wanted to test if the price of tee shirts is significantly greater in Spain than in Italy? . Indicate an appropriate null and alternative hypotheses for this situation if we wanted to know if there is a significant difference in the price of tee shirts in the two countries. 4.Chapter 9: Hypothesis testing for different populations 321 EXERCISE PROBLEMS 1. 2. Confirm your conclusions to Question 2 using the p-value approach. does this data indicate that there has been a significant increase in the price of gasoline in France? 3.381 per litre with a standard deviation of €0. What do you think explains these results? 2.

does this data indicate that those stores using FAX keep a higher level of inventory than those using Internet? 5. does this data indicate that those stores using FAX keep a higher level of inventory than those using Internet? 3. 2.322 Statistics for Business 7. This means that the store has on average 14 days supply of products to satisfy estimated sales until the next delivery arrives from the distribution centre. 6. Using the critical value method. does the data indicate that the price of tee shirts is greater in Spain than in Italy? 8. Confirm your conclusions to Question 4 using the p-value approach. In the database system when an order is taken from a customer . at a 1% significance level. Using the critical value method. Confirm your conclusions to Question 7 using the p-value criterion? 3. at a 1% significance level. For example. The headquarters of the chain collected the following sample data from 12 stores that used direct FAX and 13 that used Internet connections for the same non-perishable items in terms of the number of days’ coverage of inventory. 4. Using the critical value method. How might you explain the conclusions obtained from Questions 4 and 5? 4. Confirm your conclusions to Question 2 using the p-value approach. at a 5% significance level. Stores FAX Stores internet 14 12 11 8 13 14 14 11 15 6 11 3 15 15 17 8 16 7 14 22 22 19 16 3 4 Required 1. the first value for a store using FAX has a value of 14. Inventory levels Situation A large retail chain in the United Kingdom wanted to know if there was a significant difference between the level of inventory kept by its stores that are able to order on-line through the Internet with the distribution centre and those that must use FAX. Restaurant ordering Situation A large franchise restaurant operator in the United States wanted to know if there was a difference between the number of customers that could be served if the person taking the order used a database ordering system and those that used the standard handwritten order method. Indicate an appropriate null and alternative hypotheses for this situation if we wanted to show if those stores ordering by FAX kept a higher inventory level than those that used Internet.

It selects 11 of its key stores in the Birmingham and Coventry area and sends these sales staff progressively to a training programme in London. Confirm your conclusions to Question 2 using the p-value approach. The following sample data was taken from some of the many restaurants within the franchise of the average number of customers served per hour per waiter or waitress. Using the critical value method. Standard (S) Using database (D) 23 30 20 38 34 43 6 37 25 67 25 43 31 42 22 34 30 50 34 45 Required 1. techniques of how to spend more time on the high-revenue products. Sales revenues Situation A Spanish-based ladies clothing store with outlets in England is concerned about the low store sales revenues. What is an appropriate null and alternative hypotheses for this situation? 2. and generally how to improve team work within the store. The after data is based on a consecutive 3-month period after the training programme had been completed for all pilot stores. Using the same 1% significance level how could you rewrite the null and alternative hypothesis to show that the data indicates better the franchise operator’s belief? 5. Thus it takes additional time. . When orders are made by hand the waiter or waitress has to go to the kitchen and give the order to the chef. Confirm your conclusions to Question 5 using the p-value approach. 7. The firm decided that it would extend the training programme to its other stores in England if the training programme increased revenues by more than 10% of revenues in its pilot stores before the programme. 6. does the data support the belief of the franchise operator? 3.Chapter 9: Hypothesis testing for different populations 323 it is transmitted via the database system directly to the kitchen. What do you think are reasons that some of the franchise restaurants do not have a database ordering system? 5. at a 1% significance level. In an attempt to reverse this trend it decides to conduct a pilot programme to improve the sales training of its staff. Test your relationship in Question 4 using the critical value method. The table below gives the average monthly sales in £ ’000s before and after the training programme. The franchise operator believed that up to 25% more customers per hour could be served if the restaurants were equipped with a database ordering system. 4. This training programme includes how to improve customer contact. The before data is based on a consecutive 6-month period.

Verify your conclusion to Question 5 by using the p-value approach. Indicate the null and alternative hypotheses for this situation if we wanted to know if the training programme has reached its objective? 3. 4. It decides to see if improvements could be made by extensive advertising and reducing prices. What is the benchmark of sales revenues on which the hypothesis test programme is based? 2. 5. does it appear that the objectives of the training programme have been reached? 4. Using the critical value approach at a 15% significance level.324 Statistics for Business Store number Average sales before (£ ’000s) Average sales after (£ ’000s) 1 2 3 4 5 6 275 299 7 8 9 249 258 10 265 267 11 302 391 256 202 302 289 203 189 302 345 259 357 259 358 368 402 Required 1. does it appear that the objectives of the advertising programme have been reached? 3. Using the critical value approach at a 1% significance level. and a 3-month period after advertising for the same hotels. 6. 7. Using the critical value approach at a 1% significance level. Indicate the null and alternative hypotheses for this situation if we wanted to know if the advertising programme has reached an objective to increase the yield rate by more than 10%? 2. Should management be satisfied with the results obtained? . The data collected is given in the following table. It selects nine of its hotels and measures the average yield rate (rooms occupied/rooms available) in a 3-month period before the advertising. Using the critical value approach at a 5% significance level. Verify your conclusion to Question 3 by using the p-value approach. Hotel yield rate Situation A hotel chain is disturbed about the low yield rate of its hotels. What are your comments on this test programme? 6. does it appear that the training programme has reached its objective? 6. does it appear that the objectives of the advertising programme have been reached? 5. Verify your conclusion to Question 2 by using the p-value approach. Verify your conclusion to Question 4 by using the p-value approach. Hotel number Yield rate before (1) Yield rate after (2) 1 52% 72% 2 47% 66% 3 62% 75% 4 65% 78% 5 71% 77% 6 59% 82% 7 81% 89% 8 72% 79% 9 91% 96% Required 1.

Using the critical value approach at a 1% significance level. Hotel customers Situation A hotel chain was reviewing its 5-year strategic plan for hotel construction and in particular whether to include a fitness room in the new hotels that it was planning to build. 4. drinking too much coffee. This was then calculated into the average number per month. does it appear that eliminating coffee the objectives of the reduction in migraine headaches has been reached? 5. Then they were asked to stop drinking coffee for 3 months and record again the number of migraine attacks they experienced. There are medicines available but their efficiency is often questioned. Verify your conclusion to Question 2 by using the p-value approach. or consuming too much sugar. Verify your conclusion to Question 4 by using the p-value approach. A study was made on 10 volunteer patients who were known to be migraine sufferers. approximately what reduction in the average number of headaches has to be experienced before we can say that eliminating coffee is effective? 7. It had made a survey in 2001 on customers’ needs and in a questionnaire of 408 people surveyed.Chapter 9: Hypothesis testing for different populations 325 7. This again was reduced to a monthly basis. At a 1% significance level. The complete data is in the table below. Indicate the null and alternative hypothesis for this situation if we wanted to show that the complete elimination of coffee in a diet reduced the impact of migraine headaches by 50%. They begin with blurred vision either in one or both eyes and then are often followed by severe headaches. What are your comments about this experiment? 8. does it appear that eliminating coffee the objectives of the reduction in migraine headaches has been reached? 3. These patients were first asked to record over a 6-month period the number of migraine headaches they experienced. 192 said that they would prefer to make a reservation with a hotel that had a fitness room. 6. Using the critical value approach at a 10% significance level. A similar survey was made in 2006 and out of 397 persons who returned . Migraine headaches Situation Migraine headaches are not uncommon. Studies have indicated that migraine is caused by stress. 2. Patient Average number per month before (1) Average number per month after (2) 1 23 12 2 27 18 3 24 14 4 18 5 5 31 12 6 24 12 7 23 15 8 27 12 9 19 6 10 28 14 Required 1.

what are your conclusions if you test at a significance level of 1%? 8. does it appear that there is a significant difference in flight delays in 2005 and 1996? 5. Using the critical value approach at a 5% significance level. in a sample of 508 flights. 2. What has to be the significance level in order for your conclusions in Question 7 to be different? 9. Verify your conclusion to Question 2 by using the p-value approach. Using the critical value approach at a 1% significance level. does it appear that there is a significant difference between flight delays in 2005 and 1996? 3. 7. Verify your conclusion to Question 4 by using the p-value approach. Indicate an appropriate null and alternative hypotheses for this situation to respond to the question has there been a significant increase in flight delays between 1996 and 2005? 7. if the difference was greater than 20 minutes of the scheduled time. Required 1. 2. 6. In 1996 out of a sample of 456 flights. Flight delays Situation A study was made at major European airports to see if there had been a significant difference in flight delays in the 10-year period between 1996 and 2005. 242 were delayed. 5. Indicate the null and alternative hypotheses for this situation if we wanted to see if the customer needs for a fitness room in 2006 is greater than that in 2001. does it appear that customer needs in 2006 are greater than in 2001? 6. 4. Using the critical value approach at a 5% significance level. From the relationship in Question 6 and using the critical value approach. A flight was considered delayed. What are your comments about the results? 9. 4. Using the critical value approach at a 5% significance level. In 2005. Indicate an appropriate null and alternative hypotheses for this situation. does it appear that there is a significant difference between customer needs for a fitness room in 2006 than in 2001? 3. Required 1. either on takeoff or landing.326 Statistics for Business the questionnaire. 310 were delayed more than 20 minutes. What are your comments about the sample experiment? . 210 said that a hotel with a fitness room would influence booking decision. Verify your conclusion to Question 5 by using the p-value approach. Indicate an appropriate null and alternative hypothesis for this situation. Verify your conclusion to Question 2 by using the p-value approach.

In June 2006 it was in Germany. Required 1. moderate. and low. Indicate an appropriate null and alternative hypotheses for this situation to test whether people’s interest in the World Cup has changed between 1998 and 2006. In 2002 it was in Korea and Japan. Verify your conclusion to Question 2 by using the p-value approach. In 2006 out of a sample of 112 people taken in early June. Travel time and stress Situation A large company located in London observes that many of its staff are periodically absent from work or are very grouchy even when at the office. 6. As a result of these comments the human resource department of the firm sent out 200 questionnaires to its employees asking the simple question what is your commuting time to work and how do you feel your stress level on a scale of high. Travel time Less than 30 minutes 30 minutes to 1 hour Over 1 hour High stress level 16 23 27 Moderate stress level 12 21 25 Low stress level 19 31 12 . does it appear that there is a difference between people’s interest in the World Cup between 1998 and 2006? 3. Using the critical value approach at a 5% significance level. World Cup Situation The soccer World Cup tournament is held every 4 years. What are your comments about the sample experiment? 11. and in June 1998 it was in France. 10. Indicate an appropriate null and alternative hypotheses to test whether there has been a significant increase in interest in the World Cup between 1998 and 2006? 7.Chapter 9: Hypothesis testing for different populations 327 10. Verify your conclusion to Question 4 by using the p-value approach. or often late. Using the critical value approach at a 1% significance level. Casual remarks indicate that they are stressed by the travel time into the City as their trains are crowded. A survey was taken to see if people’s interest in the World Cup had changed in Europe between 1998 and 2006. 92 said that they were interested in the World Cup. 4. Confirm your conclusions to Question 7 using the p-value criterion. A random sample of 99 people was taken in Europe in early June 1998 and 67 said that they were interested in the World Cup. From the relationship in Question 6 and using the critical value approach. does it appear that there is a difference between people’s interest in the World Cup between 1998 and 2006? 5. what are your conclusions if you test at a significance level of 1%? 9. The table below summarizes the results that it received. 2.

Using the critical value approach of the chi-square test at a 5% significance level. The following information was collected by simple telephone contact on the number of people in those listed countries on whether or not they used the stock market as their investment strategy. Verify your conclusion to Question 1 by using the p-value approach of the chi-square test. Indicate the appropriate null hypothesis and alternative hypothesis for this situation if we wanted to test to see if stress level is dependent on travel time.328 Statistics for Business Required 1. Show the appropriate null hypothesis and alternative hypothesis for this situation if we wanted to test if there is a dependency between savings strategy and country of residence. 4. Corroborate your conclusion to Question 4 by using the p-value approach of the chi-square test. does it appear that there is a relationship between investing in stocks and the country of residence? 3. Would you say based on the returns received that the analysis is a good representation of the conditions at the firm? 7. Verify your conclusion to Question 2 by using the p-value approach of the chi-square test. Savings strategy Invest in stocks Do not invest in stocks United States 206 128 Germany 121 118 Italy 147 143 England 151 141 Required 1. does it appear that there is a relationship between stress level and travel time? 3. does it appear that there is a relationship between stress level and travel time? 5. 2. This information would be useful as to increase the firm’s presence in countries other than the United States. Investing in stocks Situation A financial investment firm wishes to know if there is a relationship between the country of residence and an individual’s saving strategy regarding whether or not they invest in stocks. 2. Using the critical value approach of the chi-square test at a 1% significance level. . What additional factors need to be considered when we are analysing stress (a much overused word today!)? 12. 6. Using the critical value approach of the chi-square test at a 1% significance level.

What are your comments about the results? 14. Using the critical value approach of the chi-square test at a 3% significance level. Verify your conclusion to Question 3 by using the p-value approach of the chi-square test. The sample information obtained is in the table below. Automobile preference Situation A market research firm in Europe made a survey to see if there was any correlation between a person’s nationality and their preference in the make of automobile they purchase. A survey was made in Italy.Chapter 9: Hypothesis testing for different populations 329 4. . 6. 4. Verify your results to Question 2 by using the p-value approach of the chi-square test. does it appear that there is a relationship between automobile purchase and nationality? 3. What has to be the significance level in order that there appears a breakeven situation between a dependency of nationality and automobile preference? 5. and France and the sample information obtained is given in the table below. Using the critical value approach of the chi-square test at a 1% significance level. Spain. Indicate the appropriate null and alternative hypotheses to test if the make of automobile purchased is dependent on an individual’s nationality. Newspaper reading Situation A cooperation of newspaper publishers in Europe wanted to see if there was a relationship between salary levels and the reading of a morning newspaper. Germany. does it appear that there is a relationship between investing in stocks and the country of residence? 5. Germany Volkswagen Renault Peugeot Ford Fiat 44 27 22 37 25 France 27 32 33 16 15 England 26 24 22 37 30 Italy 19 17 24 25 31 Spain 48 32 27 36 19 Required 1. What are your observations from the sample data and what is a probable explanation? 13. 2.

4.000 5 62 52 22 Required 1. 6. does it appear that there is a relationship between reading a newspaper and salary? 5. Show the appropriate null hypothesis and alternative hypothesis for this situation if we wanted to test if there is a dependency between wine consumption and country of residence.000 3 65 47 19 €75. Before it makes any decision it wants to know if a particular country. 2. and thus the culture. Wine consumption Situation A South African producer is planning to increase its export of red wine.330 Statistics for Business Salary bracket Salary category Always read Sometimes Never read €16. What are your comments about the sample experiment? 15.000 to €100.000 4 65 47 19 €100. Using a market research firm it obtains the following sample information on the quantity of red wine consumed per day.000 to €50. does it appear that there is a relationship between reading a newspaper and salary? 3. Using the critical value approach of the chi-square test at a 5% significance level. Amount consumed Never drink One glass or less Between one and two More than two England 20 72 85 85 France 10 77 65 79 Italy 15 70 95 77 Sweden 8 62 95 85 United States 12 68 48 79 Required 1. Verify your results to Question 4 by using the p-value approach of the chi-square test. has any bearing on the amount of wine consumed. Indicate the appropriate null and alternative hypotheses to test if reading a newspaper is dependent on an individual’s salary.000 2 55 40 28 €50.000 to €75. Using the critical value approach of the chi-square test at a 10% significance level. .000 1 36 44 30 €16. Verify your results to Question 2 by using the p-value approach of the chi-square test.

487 49.175 41.550 50.717 38.805 56.231 52.934 55.579 44.941 54.086 47.149 54.829 39.514 47.758 65.534 49.268 60.108 48.278 43.581 51.946 40.398 39.305 49.663 43.094 47.135 41.359 51.460 46.358 63.583 45.113 56.182 59.970 42.812 52.319 53.123 53.037 38.468 43.100 44.202 36.317 50.870 37.586 45.787 43.862 50.961 54.676 37.142 46.518 47.847 39.608 39.346 46.169 51.824 52.048 43.461 48.847 45. What is the chi-square value for the significance level of Question 4? 6.147 47.700 47.467 60.279 51.797 40.002 47.445 48.578 38.757 50.105 41.549 44.181 46.261 51. 5.263 56.373 50.082 52.978 40.738 55.175 46.600 52.465 52.921 Germany 45.665 50.056 51.382 50.555 50.767 40.829 44.397 56. Based on your understanding of business.615 42.716 40.276 41.125 53.911 43.386 62.070 50.189 52.621 43.391 54.164 51.100 50.527 48.759 54.384 55.896 54.329 34.754 48.423 57.256 45. what is the trend in wine consumption today? 16. Case: Salaries in France and Germany Situation Business students in Europe wish to know if there is a difference between salaries offered in France and those offered in Germany.093 47.242 49.961 41.821 47.369 51.872 49.438 54.283 51.746 49.294 61.536 53.208 51.807 46.001 52.574 37.024 46.013 47.453 53.790 38.334 53.939 48.681 46.498 47.795 46.066 49.557 41.592 48.904 49.920 53.171 49.151 44.065 . what has to be the minimum significance level in order to change the conclusion to Question 1? This is the p-value.878 52.892 48.270 52. Verify your conclusion to Question 2 by using the p-value approach.976 43.349 48.990 41.555 56.156 59.Chapter 9: Hypothesis testing for different populations 331 2.415 56.703 52.045 51.469 43.406 54.922 41.728 44.118 56.499 38.379 44.240 54.402 46.055 47.502 49.466 52.548 44.671 44.133 53.330 37.353 43.566 47.343 59.775 46. Using the critical value approach of the chi-square test at a 1% significance level.307 47.704 50.684 45.349 41.030 50. To the nearest whole number.892 41.786 49.125 51.382 39.128 49.006 55.570 43.863 41.545 57.292 42.694 41.628 49. An analysis was made by taking random samples from alumni groups in the 24–26 age group.609 55.129 47.799 45.800 52.379 46.457 42.271 52.303 36.628 50.863 49.889 52.533 50. France 52.852 55.137 42.513 44.566 45.179 50.787 45.161 38.870 44.668 48.953 42.131 51.833 43.336 49.658 43.578 36.457 36.235 47.326 36.986 40.864 43.838 40.589 52.493 36.847 52.491 48. does it appear that there is a relationship between wine consumption and the country of residence? 3.836 54. This information is given in the table below.198 44.091 63.584 42.481 60.568 44.975 44.806 44.924 46.937 51.683 45. 4.465 48.684 53.263 53.116 43.451 52.018 39.923 42.161 43.559 47.214 45.134 38.653 46.370 49.909 38.371 44.896 32.841 50.

613 56.184 41.856 43.115 40.585 42.893 48.148 42.800 Required 1.115 50.424 55.051 40.262 47.222 41.342 54.672 41.451 43.060 44.128 42.751 47.266 54.947 53.044 42.185 50.880 58.799 46.056 52.694 47.171 50.988 46.530 38.980 47.651 52.308 43.426 49.322 51.735 50.467 46.265 55.184 60.323 46.322 46.053 50.050 49.146 41.190 45.687 49.546 58.159 38.856 58.064 48.713 45.507 55.263 48.861 41.809 51.448 37.507 41.428 43.380 44.050 59.621 43.665 55.244 49.348 50.669 44.732 35.507 59.128 48.018 43.165 40.152 52.881 55.533 54.420 48.020 48.409 44.999 51.342 47.793 55.278 63.138 33.354 38.989 52. Using all of the concepts developed from Chapters 1 to 9 how might you interpret and compare this data from the two countries? .928 57.323 42.822 53.012 40.672 49.559 44.586 33.278 53.926 44.884 56.240 43.231 53.979 57.338 52.916 48.938 44.866 59.914 52.335 53.504 42.332 Statistics for Business 52.012 53.671 45.440 55.462 46.812 52.755 54.369 59.

If we make an optimistic forecast – estimating more than actual. cash flow. which is a time series analysis for the amount of goods imported into the United States each year from 1996 to 2006. transportation volumes. Forecasts trigger strategic and operations planning. or insufficient storage space. unused storage space. In either case there is a cost.Forecasting and estimating from correlated data 10 Value of imported goods into the United States Forecasting customer demand is a key activity in business. If we are pessimistic in our forecast – estimating less than actual. www. An often used approach is to use historical or collected data as the basis for forecasting on the assumption that past information is the bellwether for future activity. we may have stockouts. we would say that there has been a reasonable linear growth in imported goods in the last decade from 1960. . Thus business must be accurate in forecasting. or unwanted personnel. 8 June 2007. hiring or termination of personnel.1. warehouse space. irritated or lost customers.gov/foreign-trade/statistics/historical goods. outsourcing requirements.census. and the like. we may be left with excess inventory.1 Consider for example that we are now in the year 1970. Forecasts are used to determine capital budgets. Foreign Trade division. raw material quantities. inventory levels. Then if we used a linear relationship for this 1 US Census Bureau. Consider the data in Figure 10. In this case.

Forecasting concepts are the essence of this chapter.1 Value of imported goods into the United States. . The actual value is $1. as the years progress. rather than using a linear relationship we should use a polynomial relationship on all the data or perhaps a linear regression relationship just for the period 2000–2005. we would arrive at a value of $131.000 800. and other emerging countries many of which are destined for Wal-Mart! Thus.800.334 Statistics for Business Figure 10.000 1. 2.000 1.000.000 0 60 65 70 75 80 85 90 95 00 05 19 19 19 19 19 19 19 19 20 20 20 10 Year period to forecast the value of imported goods for 2006. Quantitative forecasting methods are extremely useful statistical techniques but you must apply the appropriate model and understand the external environment.000 400.600.000 $millions 1.000 200.000.380 million or our forecast is low by an enormous factor of 14! As the data shows.861.200. India.050 million. there is an increasing or an almost exponential growth that is in part due to the growth of imported goods particularly from China.000 600.000 1.000 1. 1960–2006.400.

or wage levels are typically illustrated by a time series. or years A scatter diagram is the presentation of the time series data by dots on an x y graph to see if there is a correlation between the two variables. or independent variable. or costs can be presented in a time series. Operating data for example customer service level. Macro-economic data such as Gross National Product. months. such as revenues. capacity utilization of a tourist resort. If there is a reasonable correlation. Financial data such as revenues. or the measurement of the strength of a relationship between variables. Consumer Price Index.Chapter 10: Forecasting and estimating from correlated data 335 Learning objectives After you have studied this chapter you will understand how to correlate bivariate data and use regression analysis to make forecasts and estimates for business decisions. then regression analysis is a mathematical technique to develop an equation that describes the relationship between the variables in question. In a time series we are presenting one variable. The practical use of this part of statistical analysis is that correlation and regression can be used to forecast sales or to make other decisions when the developed relationship from past data can be considered to mimic future conditions. or quality levels can be similarly shown. time. The time. Scatter diagram A Time Series and Correlation A time series is past data presented in regular time intervals such as weeks. and this is called bivariate data. profits. to illustrate the movement of specified variables. These topics are covered as follows: ✔ ✔ ✔ ✔ ✔ ✔ ✔ A time series and correlation • Scatter diagram • Application of a scatter diagram and correlation: Sale of snowboards – Part I • Coding time series data • Coefficient of correlation • Coefficient of determination • How good is the correlation? Linear regression in a time series data • Linear regression line • Application of developing the regression line using Excel: Sale of snowboards – Part II • Application of forecasting or estimating using Microsoft Excel: Sale of snowboards – Part III • The variability of the estimate • Confidence in a forecast • Alternative approach to develop and verify the regression line Linear regression and causal forecasting • Application of causal forecasting: Surface area and house prices Forecasting using multiple regression • Multiple independent variables • Standard error of the estimate • Coefficient of multiple determination • Application example of multiple regression: Supermarket Forecasting using non-linear regression • Polynomial function • Exponential function Seasonal patterns in forecasting • Application of forecasting when a seasonal pattern exists: Soft drinks Considerations in statistical forecasting • Time horizons • Collected data • Coefficient of variation • Market changes • Models are dynamic • Model accuracy • Curvilinear or exponential models • Selecting the best model A useful part of statistical analysis is correlation. against another variable. is presented on .

2 Scatter diagram for the sale of snowboards. units y 60 90 110 320 250 525 400 800 1. the scatter diagram for the data from Table 10.200 1.800 2.600 2.100 2.800 1.600 1.1 Year x 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Sales of snowboards.400 1. The variable on the y-axis is considered the dependent variable since it is “dependent”.000 800 600 400 200 0 19 89 19 90 19 91 19 92 19 93 19 94 19 95 19 96 19 97 19 98 19 99 20 00 20 01 20 02 20 03 20 04 20 05 20 06 Year .1.2.000 1. on the y-axis.400 Application of a scatter diagram and correlation: Sale of snowboards – Part I Consider the information in Table 10.600 1. Sales. which is a time series for the sales of snowboards in a sports shop in Italy since 1990.550 2. tomorrow will always come! Table10.200 Snowboards sold (units) 2.400 2.1 is shown in Figure 10.000 2. of the time.500 2. or the ordinate. or a function. We Figure 10. or a stock market crash. Time is always shown on the x-axis and considered the independent variable since whatever happens today – an earthquake.200 985 1.336 Statistics for Business the x-axis or abscissa and the variable of interest. a flood. 2. Using in Excel the graphical command XY(scatter).

2. Figure 10.1.000 1.800 2. With the snowboard sales data calculation is not a problem since the time in years is already numerical data. However. a 12-month period would appear coded as in Table 10. etc. 3 1992. or correlation. Thus for information. Table 10. between the sale of snowboards. 2 1991. (Note that in Appendix II you will find a guide of how to develop a scatter diagram in Excel. 2 1991.2.2 Month January February March April May June July August September October November December Codes for time series data. This is especially the case when the time is mixed alphanumeric data since it is not always convenient to perform calculations with such data.3 is identical to Figure 10.Chapter 10: Forecasting and estimating from correlated data can see that it appears there is a relationship.3 gives the scatter diagram using a coded value for x where 1 1990.3 Scatter diagram for the sale of snowboards using coded values for x.200 1. the x-values are large and these can be cumbersome in subsequent calculations. etc. For example.400 2.600 1. Code 1 2 3 4 5 6 7 8 9 10 11 12 Figure 10.) 337 Coding time series data Very often in presenting time series data we indicate the time period by using numerical codes starting from the number 1. The form of this scatter diagram in Figure 10. and the year in that sales are increasing over time.200 Snowboards sold (units) 2.) 12 13 14 15 16 17 .600 2.800 1.400 1.000 800 600 400 200 0 0 1 2 3 4 5 6 7 8 9 10 11 Year (1 1990. rather than the actual period.

057. r. it means that for the range of data given the variable y decreases with x.400 203. x (coded) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 136 y 60 90 110 320 250 525 400 800 1.600 8.402.280 1. This is close to 1.0 and thus it indicates there is a strong correlation between x and y.3 using a coded value for the time period rather than using the numerical values of the year.100 954.800 6.936 18.000 6. If r is positive.200 2.800 9.912.000 2.250. Coefficient of determination The coefficient of determination.500 275. r2.9316 .200 985 1. However it is not necessary to go through this complicated procedure as the coefficient of correlation can be determined by using [function CORREL] in Excel.496 y2 3.440 179.500 4. If r is negative. a next step is to determine the strength or the importance of the relationship between the time or independent variable x.000 1.400 10. It does not matter which. The value of r is either plus or minus and can take on any value between 0 and 1.625 160.560.410. When Table 10.000 970. In the case of the snowboard sales given in the example.000 640.500 38.225 2.3 x (year) 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Total r approaches zero it means that there is a very weak relationship between x and y.000 35.890 xy 60 180 330 1.000 29.400 62. The calculation steps using equation 10(i) are given in Table 10. The closer the value of r is to unity.500 2.150 2. One measure is the coefficient of correlation. r 0.251.600 26.496 464.000.440. is another measure of the strength of the relationship Coefficients of correlation and determination for snowboards using coded values of x.800 285.297.100 2.600 18.160 5. as the result is the same. r ⎡ ⎢ n ∑ x2 ⎢⎣ n ∑ xy (∑ x) 2⎤ ⎡ ∑x∑ y ⎥ ⎢ n ∑ y2 ⎥⎦ ⎢⎣ (∑ y) ⎥ ⎥⎦ 10(i) 2⎤ Here n is the number of bivariate (x.000 2.600 1.000 5.200 x2 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 1.700 0.000 4. which is defined by the rather horrendous-looking equation as follows: Coefficient of correlation.250 3. y) values.850 17.100 12. and the dependent variable y.760.9652.9652 0. the stronger is the relationship between the variables x and y.050 n n Σxy Σx Σy n Σx2 (Σx)2 n Σy2 (Σy)2 n Σxy Σx Σy n Σx2 (Σx)2 n Σy2 (Σy)2 r r2 16 3.338 Statistics for Business Coefficient of correlation Once we have developed a scatter diagram. it means that y increases with x. In Excel [function PEARSON] can also be used to determine the coefficient of correlation.640.550 2.040 23.400 16.000 31. You simply enter the corresponding values for x and y where x can either be the indicated period (provided it is in numerical form) or the code value.272.100 102.

b is a constant value and equal to the slope of the regression line. The equation for the coefficient of determination is as follows: Coefficient of correlation. Further.3. the coefficient of determination. For the snowboard sales the value of r2 is 0. the coefficient of determination always has a positive value. How good is the correlation? Analysts vary on what is considered a good correlation between bivariate data.8) 0. is reasonably strong. is always equal to or less than r.9316. 339 Linear regression line The linear regression line is the best straight line that minimizes the error between the data points on the regression line and the corresponding actual data from which the regression line is developed. When r 1. of the actual dependent variable.Chapter 10: Forecasting and estimating from correlated data between x and y.8944). x. then we can develop a linear regression equation to define this relationship. y. then r2 1. ● ( n ∑ xy ∑ x ∑ y ) 2⎤ ⎡ 2⎤ ⎡ ⎢ n ∑ x2 ( ∑ x ) ⎥ ⎢ n ∑ y2 ( ∑ y ) ⎥ ⎢ ⎥⎢ ⎥ 2 a bx 10(iii) ⎣ ⎦⎣ ⎦ 10(ii) ● Again we can obtain the coefficient of determination directly from Excel by using [function RSQ]. ● ● a is a constant value and equal to the intercept on the y-axis. numerically the value of r2. I say that if you have a value of r2 of at least 0. r2 we can subsequently use this equation to forecast beyond the time period given. Again for completeness. the calculation using equation 10(ii) is shown in Table 10.0. r. Since it is the square of the coefficient of correlation. where r can be either negative or positive. and the aver– using the two equations age value of y or y below. or less than 1. since r is always equal to.0 which means that there is a perfect correlation between x and y. or forecast value. the coefficient of correlation. It does not matter which we use as the result is the same: b ∑ xy ∑ x2 y bx nx y n(x )2 10(vi) a 10(vii) . and the independent time variable. Another approach is to calculate b and a – using the average value of x or x . After that. The values of the constants a and b can be calculated by the least squares method using the following two relationships: a ∑ x2 ∑ y ∑ x ∑ xy 2 n ∑ x2 ( ∑ x ) n ∑ xy n ∑ x2 10(iv) b ∑x∑y 2 (∑ x) 10(v) Linear Regression in a Time Series Data Once we have developed a scatter diagram for a time series data. which means a value of r of about 0.9 (actually (0. ˆ y is the predicted.0. x is the time and the independent variable value. and the strength of the relationship between the dependent variable. then there is a reasonable relationship between the independent variable and the dependent variable.8. The following equation represents the regression line: ˆ y Here. y.

600 1.200 x2 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 1.410.890 1.4 x (year) 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Total Average Regression constants for snowboards using coded value of x.000 2. ● ● ● Select Type Select Linear Select Options and check Display equation on chart and Display R-squared value on chart Application of developing the regression line using Excel: Sale of snowboards – Part II Once we have the scatter diagram for the bivariate data we can use Microsoft Excel to develop the regression line.6250 175.250.000 35.550 2. This is the Microsoft Excel format.100 2.000 970.3971 435.280 1.400 10.600 18.055.400 62.600 8.000 2.100 102.150 2.890 1.402.000 5.3971 8.250 3.000 6.2500 – x – y b using equation 10(vi) a using equation 10(vii) The calculations using these four equations are given in Table 10.496 y2 3.800 6.050 n Σx Σy Σx2 Σxy n Σx2 (Σx)2 n Σxy a using equation 10(iv) b using equation 10(v) 16 136 16. To do this we first select the data points on the scatter diagram and then proceed as follows: ● ● This final window is shown in Figure E-7 of the Appendix II.000 1.400 16.000.000 31. On the graph we have the regression line written as follows.936 18.500 4.4 for the snowboard sales using the coded values for x.251. However.200 435. again it is not necessary to perform these calculations because all the relationships can be developed from Microsoft Excel as explained in the next section.2500 In the form of equation 10(iii) it would be reversed and written as. which is a different form as represented by equation 10(iii).496 203.3971x In the menu select of Excel.4.560.5000 1.500 2.100 12.200 985 1.200 23.340 Statistics for Business Table 10.225 2.000 640.850 17.800 9.2500 175.3971x 435. x (coded) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 136 8.625 160.760.6250 xy 60 180 330 1.500 275.5000 y 60 90 110 320 250 525 400 800 1.2500 175. The regression line using the coded values of x is shown in Figure 10. select Chart Select Add trend line .000 29.440. y 175.400 203.000 4. ˆ y 435.500 38.496 3.057.055.600 26.

assume that we want to forecast the sale of . The intercept.200 1.2500. When the value of a is negative. 2.25 units which has no meaning for this situation. Application of forecasting. b.000 800 600 400 200 0 0 1 2 3 4 5 6 7 8 9 Year 10 11 12 13 14 15 16 17 y 175.5.3971x. This is because the values of x are the real values and not coded values.000 1. The regression line using the actual values of the year is shown in Figure 10. the intercept on the y-axis is 435.800 2.4. avoid starting an equation with a negative value. That is.4 Regression line for the sale of snowboards using coded value of x.3971x 435.3971x 435.Chapter 10: Forecasting and estimating from correlated data 341 Figure 10.2500 rather than ˆ y 435. means that when x is zero the sales are 432. same where y is y and the slope of the line. 0. but the slope of the curve is positive.600 1. which appears on the graph. a.600 2. the regression information is the ˆ.2500 175. it is normal to show the above equations for this example in the ˆ form y 175. These numbers are the same as calculated and presented in Figure 10.3971 (say about 175 units) per year. then we can forecast or estimate a future value at a given date using in Excel [function FORECAST].9316 However.5 is the value of the intercept a. For example.800 1.400 1. using Microsoft Excel: Sale of snowboards – Part III If we are satisfied that there is a reasonable linear relationship between x and y as evidenced by the scatter diagram.9316.200 Snowboards sold (units) 2. is 175.3971 and. The only difference from Figure 10. a.2500 R 2 0. The slope of the line means that the sale of snowboards increases by 175.400 2. is the same as previously calculated though note that Microsoft Excel uses upper case R2 rather than the lower case r2. or estimating. The coefficient of determination.

x . n. for each random variable x. the less accurate will be the forecast.248 units. a forecast of sales for next year may be reasonably reliable. the assumption is that the pattern of past years will be repeated in future years.400 2. 2.200 1.000 1. s s2 ∑ (x (n x )2 1) 2(viii) The standard deviation is a measure of the – variability around the sample mean. which in this case is 21. we must use a code value for the year 2010.9316 Year snowboards for 2010.342 Statistics for Business Figure 10. Sample standard deviation. – about the mean value x is zero (equation 2(ix)). ∑ (x x) 0 2(ix) . The variability of the estimate In Chapter 2. we can use the coded values of x that appear in the 2nd column of Table 10.300. x. (Year 2005 has a code of 16. Further. We enter into the function menu the x-value of 2010 and the given values of x in years and the given value of y from Table 10. the further out we go in time. whereas a forecast 20 years from now would not. of data by the equation. s.3971x 349.800 1.2 and the corresponding actual data for y.600 1. Also. which may not necessarily be the case.800 2. For example. This gives a forecast value of y of 3. we presented the sample standard deviation.5 Regression line for the sale of snowboards using actual year.200 Snowboards sold (units) 2.1.600 2. thus year 2010 16 5 21.400 1. the deviation of all the observations. in a given sample size. or.000 800 600 400 200 0 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 20 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 06 y 175. If we do this.0000 R 2 0.) Note that in any forecasting using time series data. Alternatively.

We also have the degrees of freedom.803.8031 234.5 Statistics for the regression line.9316 190. and the coefficient of determination. the slope of the line. given by.3971 12.740. and the predicted values.7000 435. Like the standard deviation. y balance out when all data are considered. Again. The regression equation. The denominator in this equation is (n 2) or the number of degrees of freedom. If this is the case. the closer to zero is the value of the standard error then there is less scatter or deviation around the regression line. se. a and b. b. analogous to equation 2(ix) this means that. To do this we select a cellblock of two columns by five rows and enter the given x. r2.7380 10. we do not have to go through a stepwise calculation but the standard error of the estimate. The explanations are given to the left and to the right of each column. can be determined by using in Excel [function LINEST]. the value a.6029 se standard error of estimate degrees of freedom (n 2) In a similar manner. Thus. is determined so that the vertical distance between the observed.-8 ). In equation 10(viii) two degrees of freedom are lost because two statistics. or (n 2).2500 122.5. ˆ y zse 10(x) ∑ (y ˆ y) 0 10(ix) . this translates into saying that the linear regression model is a good fit of the observed data. and we should have reasonable confidence in the estimate or forecast made. a measure of the variability around the regression line is the standard error of the estimate.1764 14 767. we can determine the confidence limits of a forecast.Chapter 10: Forecasting and estimating from correlated data 343 Table 10. the intercept on the y-axis. Confidence in a forecast In a similar manner to confidence limits in estimating presented in Chapter 7.459. The other statistics are not used here but their meaning in the appropriate format is indicated in Table E-3 of Appendix II. The value of se has the same units of the dependant variable y.1471 a. or data values. Standard error of the estimate. intercept on the y-axis r2. Like the frequency distribution we execute this function by pressing simultaneously on control-shift-enter (Ctrl. If we have a sample size greater than 30 then the confidence intervals are given by.and y-values and input 1 both times for the constant data. q The statistics for the regression line for the snowboard data are given in Table 10. coefficient of determination 0. rather than (n 1) in equation 2(viii). y. slope of the line 175. together with other statistical information. y). ˆ. se ∑ (y n ˆ y)2 2 10(viii) Here n is the number of bivariate data points (x. are used in regression to compute the standard error of the estimate. Those that we have discussed so far are highlighted and also note in this matrix that we have again the value of b.

51 1.1764 3. where the degrees of freedom are given in Table 10.248 Upper limit is 3.494.71 28.7613.434.020.7613 * 234.74 617.37 154.91 2.72 1.46 19.46 90. using the total value of ˆ) (y y 2 in Column 6.318.600 1.361 Thus to better define our forecast we could say that our best estimate of snowboard sales in 1.669. To obtain a confidence level.36 230. se.34 441. For our snowboard sales situation we have a forecast of 3.844. The total of (y y in Column 5 verifies equation 10(ix).90 30.85 84.079. we can use these values to develop the specific values of the regression points and further to verify the standard error of the estimate.53 967.72 105.88 119.94 266.15 234.369.879.100 2.12 1.740.66 191.199.85 363.159.58 36.500 2.344 Statistics for Business Table 10.283.13 392.762. Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Total se n x 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 y 60 90 110 320 250 525 400 800 1. and inserting this in equation 10(viii) verifies the value of the standard error of the estimate of Table 10. The colˆ umn y is calculated using equation 10(iii) and imputing the constant values of a and b from Table ˆ) 10.052.43 11.24 2. the last column of Table 10.04 767.07 14.62 9.305.6.248 1.51 155.69 95. . For a 90% confidence limit.7613 * 234.488.68 333.371.10 y ^ y (y ^ y )2 319.18 16 With sample sizes no more than 30.5.00 102.211.550 2.143. The calculation steps are shown in Table 10.5.06 53. ˆ y tse 10(xi) 2010 is 3.361 units.836 and 3.34 28.103.71 2.212.13 792.90 0.93 1. we use a Student-t relationship since we have a sample size of 16.1764 2. And.93 56.200 985 1.62 835.5 the confidence limits are as follows: Lower limit is 3.195.000 2.5.42 8.32 1.5.248 units and that we are 90% confident that the sales will be between 2.6. Then using equation 10(xi) and the standard error from Table 10.6 Calculating the standard estimate of the regression line using coded values of x.85 174.53 167. we use a Student-t relationship and the confidence limits are.76 24.836 Alternative approach to develop and verify the regression line Now that we have determined the statistical values for the regression line.22 111.31 2. as presented in Table 10.30 3.09 479.74 92. the value of t is 1.400 ^ y 259. using [function TINV].248 units for 2010.

a real estate agent has recorded the past sale of houses according to sales price and the square metres of living space.7 Surface area and house prices.240 140.6133x Coefficient of determination. Another type of correlation is when one variable is dependent.8. the sale of household appliances is in part a function of the sale of new homes. 2.250 280. is caused by the change of the dependent variable. Compute the coefficient of correlation.250 182. Price (€) y 260. or for many products. and the square metres of living space? Here this is a causal relationship where the price of the house is a function.8623 0.6 gives the scatter diagram for this causal relationship.250 690. For example.8623 Coefficient of correlation. 1.500 680. Table 10. Show the regression line and the coefficient of determination on the scatter diagram.Chapter 10: Forecasting and estimating from correlated data 345 Linear Regression and Causal Forecasting In the previous sections we discussed correlation and how a dependent variable changed with time.9048 11. r r2 0. Visually it appears that within the range of the data given. or a function. What can you say about the coefficients of determination and correlation? What is the slope of the regression line and how is it interpreted? The regression line is shown in Figure 10.7. the house prices generally increase linearly with square metres of living space.652.500 5.252.9286 Since the coefficient of determination is greater than 0. Does there appear to be a reasonable correlation between the price of homes.9 we can say that there is quite a strong correlation .263.749. r2 0. y.125 Square metres. Thus. In these situations we say that the movement of the dependent variable.500 1. the square metres is the independent variable. and thus the coefficient of correlation is greater than 0. the quantity sold is a function of price.000 921. Develop a scatter diagram for this information.250 3.000 425.7 together with the coefficient of determination. on some other variable. or is “caused” by the square metres of living space.000 760.280.000 600. x and the correlation can be used for causal forecasting or estimating. the demand for medical services increases with an aging population. The relationships are as follows: Regression equation.945. not of time. x. The analytical approach is very similar to linear regression for a time series except that time is replaced by another variable. and the house price is the dependent variable y.646.825. ˆ y 1.000 3. The following example illustrates this. Figure 10.000 2. x 100 180 190 250 360 200 195 110 120 370 280 450 425 390 60 125 Application of causal forecasting: Surface area and house prices In a certain community in Southern France. Using the same approach as for the previous snowboard example in a time series analysis.200. This information is in Table 10.500 2.

000.263.749.000.7 Regression line for surface area and house prices.6 Scatter diagram for surface area and house prices.000. 6.000 0 0 50 100 150 200 250 Area (m 2) 300 350 400 450 500 Figure 10.000 Price (€) 11.000.000.000 1.000.000. 6.000.9048 R 2 0.000.6133x 1.000.000 4.000 1.000 5.000 5.000 2.000 y 4.346 Statistics for Business Figure 10.8623 3.000.000.000 2.000 Price (€) 3.646.000 0 0 50 100 150 200 250 Area (m2) 300 350 400 450 500 .

263.5231.244.418. €2.6133 (say 11. The danger with making this estimate is that 800 m2 is outside of the limits of our observed data (it ranges from 60 to 450 m2).463 Upper limit is. If a house on the market has a living space of 310 m2. Using [function TINV] in Excel. Lower limit of price estimate using the standard error of the estimate from Table 10.444 €3.8623 87.749. Thus you must be careful in using causal forecasting beyond the range of data collected. Using equation 10(xi).274.650 within the range of the data given.346. Thus the assumption that the linear regression equation is still valid for a living space area of 800 m2 may be erroneous.346. intercept on the y-axis r2.442. Using in Excel [function LINEST] we have in Table 10. slope of the line 11. what is a reasonable estimate for the sales price of this house? What are your comments about this figure? Using in Excel [function FORECAST] for a square metre of living space.6133 1. ˆ y tse Thus we could say that a reasonable estimate of the price of a house with 310 m2 living space is €2.700.700 and that we are 85% confident that the price lies in the range €1. where the degrees of freedom are given in Table 10.940).1999 * 1012 a. b.418.6482 3.938 (say €3. the price of the house increases by about €11.0004 14 5. x of 800 m2 gives an estimated price (rounded) of €8.274.700 1.8 is. €2.646.5231 * 609. this means to say that for every square metre in living space. 4. Using in Excel [function FORECAST] for a square metre of living space.444 €1.9048 332.346.8 Regression statistics for surface area and house prices.8.772.3383 609. x of 310 m2 gives an estimated price (rounded) of €2. If a house was on the market and had a living space of 800 m2.646.346.8 the statistics for the regression line.2554 * 1013 se standard error of estimate degrees of freedom (n 2) between house prices and square metres of living space.700 1.460) to €3.0223 1.274.418.650). coefficient of determination 0.5231 * 609. what would be a reasonable estimate of the price? Give the 85% confidence intervals for this price. the value of t for a confidence level of 85% is 1.Chapter 10: Forecasting and estimating from correlated data 347 Table 10.541.053. The slope of the regression line is 11.938 Forecasting Using Multiple Regression In the previous section on causal forecasting we considered the relationship between just one dependent variable and one independent variable.463 (say €1. . 3.

Standard error of the estimate As for linear regression. and there are four independent variables then the degrees of freedom are 16 4 1 11. number of branch offices. b2. b3 and bk are constants and slopes of the line corresponding to x1. b1. the dependent variable. is a function of the quantity we eat and the amount of exercise we do. Coefficient of multiple determination Similar to linear regression there is a coefficient of multiple determination r2 that measures the strength of the relationship between all the independent variables and the dependent variable. ˆ y is the corresponding predicted value of dependant variable from the regression equation. x2. and possibly the more uncertain is the predicted value. the forecast estimate is a causal regression equation containing several independent variables. ● ● ∑ (y n ˆ y)2 k 1 10(xiii) ● ● y is the actual value of the dependant variable. in people. In this situation. and levels of alcohol in the blood. unit prices. x1. sales revenues can be a function of advertising expenditures. obesity. As before. number of sales staff. Automobile accidents are a function of driving speed. these values of the degrees of freedom are automatically determined in Excel when you use [function LINEST]. k. The calculation of this is illustrated in the following worked example. n is the number of bivariate data points. there is a standard error of the estimate se that measures the Application example of multiple regression: Supermarket A distributor of Nestlé coffee to supermarkets in Scandinavia visits the stores periodically to . the number of independent variables. with the same 16 bivariate data values. etc. the smaller the value of the standard error of the estimate. In business. Again. This is similar to equation 10(viii) for linear regression except that there is now a term k in the denominator where the value (n k 1) is the degrees of freedom. Also note that the more the number of independent variables in the relationship then the more complex is the model. and so the degrees of freedom are 14 1 1 12 or the denominator as given by equation 10(xiii). Multiple independent variables The following is the equation that describes the multiple regression model: ˆ y Here. x3. the better is the fit of the regression equation. degree of dispersion around the multiple regression plane. x2. ˆ y is the forecast or predicted value given by the best fit for the actual data. Since there are more than two variables in the equation we cannot represent this function on a two-dimensional graph. and xk are the independent variables. number of competing products on the market. k is a value equal to the number of independent variables in the model. is 1. For example. k is the number of independent variables. and xk. In linear regression. It is as follows: se Here. road conditions. ● a b1x1 b2x2 b3x3 … bkxk 10(xii) ● ● ● ● a is a constant and the intercept on the y-plane. As an illustration. if the number of bivariate data points n is 16. x3.348 Statistics for Business Multiple regression takes into account the relationship of a dependent variable with more than one independent variable.

827.997.827.487 76. the intercept on the y-plane 14. are indicated in Table E-4 of Appendix II.938.861.82 1.125 now select a virgin area of three rows and five columns and we enter two columns for the independent variable x.9 develop a two-independent variable multiple regression model for the unit sales per month as a function of the visits per month and the allotted shelf space. Shelf space.92 2. or model. The output from using this function is in Table 10. The difference is that we Table 10.827.85 .813 33.92 2. we can use again from Excel [function LINEST].975 74. the other statistics in the non-shaded areas are not used here but their meaning. 0.01.64.01x1 9. x2 9.9095 35.10 Regression statistics for sales of Nestlé coffee–two variables. Visits/month.9 regarding the unit sales for a particular size of instant coffee. ● Degrees of freedom.997.67.50 2. the number of visits made to that store. ● b1.Chapter 10: Forecasting and estimating from correlated data meet the store manager to negotiate shelf space and to discuss pricing and other sales-related activities.227.250 63. the standard error of the estimate 5. ● b2. df 7.51.50 1. The equation.75 2.51 df 7 #N/A #N/A #N/A 2. and the total shelf space that was allotted.35 349 Unit sales/month.64 4.64x2 As the coefficient of determination. the visits per month 4.81 a 14.750 39. in the appropriate format. visits per month and the shelf space.67 4.23 r 2 b1 4.01 1. the slope corresponding to the shelf space. From the information in Table 10.16 se 5.750 71. Determine the coefficient of determination.481. Again.425 55. x1 (m2) x2 9 4 6 5 3 6 7 6 8 2 3. b2 9. that describes this relation is from equation 10(xii) for two independent variables: ˆ y ˆ y a b1x1 b2x2 14.10.35 1.67 6. Table 10.055. As for times series linear regression and causal forecasting. The statistics that we need from this table are in the shaded cells and are as follows: ● a.313 71.8 the strength of the relationship is quite good. r2 0.086.938.227.150 58.75 246.997. is greater than 0. the slope corresponding to x1.9095. For one particular store the distributor had gathered the data in Table 10.9 Sales of Nestlé coffee. y 90.663. 1.568. ● se. ● Coefficient of determination.9095.83 0.82 1.32 1.227.537.480.

28 1. x1 9 4 6 5 3 6 7 6 8 2 Shelf space (m2). where the degrees of freedom are 7 as given in Table 10.50 1. b1.10 are.425 55.938.82 1.05.32 1. y 90. 82. the shelf space.984. Estimate the monthly unit sales if eight visits per month were made to the supermarket and the allotted shelf space was 3.12.227.51 from Table 10. .6166 * 5. and x2 is the shelf space of 3.350 Statistics for Business Table 10.35 Price (€/unit). The confidence limits of sales using the standard error of the estimate of 5. and price.6166. 3.60 2.237 and 92.658. The statistics that we need from this table are: ● ● ● a.150 58.750 71.250 63.313 71.125 Sales of Nestlé coffee with three variables.84 2. Lower confidence limit is 82.938.28.11 showing now the variation in the unit price of a jar of coffee.661.837 units For the confidence intervals we use equation 10(xi).51 73.92 2.87 2. Determine the coefficient of determination. allotted shelf space. ˆ y tse 10(xi) Using [function TINV] in Excel.237 units Upper confidence limit is. the intercept on the y-plane 75.10.00 m2.01 * 8 9.997.75 2. Visits/month.00 m2. b2.6166 * 5.00 1. the value of t for a confidence level of 85% is 1. the slope corresponding to the shelf space.437 units.82.837 1. We use again from Excel [function LINEST] and here we select a virgin area of four rows and five columns and we enter three columns for the three independent variables x. What are the 85% confidence levels for this estimate? Here x1 is the estimate of sales of eight visits per month.50 2.51 92. Assume now that for the sales data in Table 10.82 1.975 74. x2 4.25 2. The output from using this function is in Table 10.827.06 2.35 1.938.11 Sales.92 2.9 the distributor looks at the unit price of the coffee sold during the period that the analysis was made.64 * 3.750 39. x3 1.837 1.75 2.67 4.487 76.00 82. visits per month. This expanded information is in Table 10. and the unit price of coffee. x2 3.25 2.437 units Thus we can say that using this regression model our best estimate of monthly sales is 82. From this information develop a threeindependent-variable multiple regression model for the unit sales per month as a function of visits per month.813 33. The monthly sales are determined from the regression equation: ˆ y 14. the slope corresponding to x1 the visits per month 2.837 units and that we are 85% confident that the sales will between 73.20 2.

is greater than 0. Examples of these are: the sales of mobile phones from about 1995 to 2000.38 #N/A #N/A #N/A a 75.00 m2.852.60 57. x. 67.591.12.747.9336. 0.491. Estimate the monthly unit sales if eight visits per month were made to the supermarket.575.50 12.28 180.945.28 1.001.491. the allotted shelf space was 3.12 variables. the value of t for a confidence level of 85% is 1.50. The equation or model that describes this relation is from equation 10(xii) for three independent variables: ● of freedom are 6 from Table 10.984.28 * 8 4.60 76.984.661. 4.984.977 and 76.31 #N/A #N/A #N/A se 5.12 are.9336.05 2.8 the strength of the relationship is quite good.661. the slope corresponding to the price.972.82 * 3.Chapter 10: Forecasting and estimating from correlated data 351 Table 10. ˆ y 75.00 m2. x3 18. r2 0. Estimated monthly sales are determined from the regression equation.101 units.6502.50x3 As the coefficient of determination.101 units 1.82x2 18.05 2. 67.591.32 b3. the standard error of the estimate 5.50. y.60 df 6 2.00 18. Forecasting Using Non-linear Regression Up to this point we have considered that the dependent variable is a linear function of one or several independent variables.658. ● Degrees of freedom 6.039 units and that we are 85% confident that the sales will between 57. ● Coefficient of determination. the best estimate of monthly sales is 67. The degrees . ● se. For some situations the relationship of the dependent variable.491.039 1.661.6502 * 5. and the unit price of coffee was €2. Lower limit is.591.977 units ˆ y ˆ y a b1x1 b2x2 b3x3 75.491.60.50.82 5.556.38 r2 0.989.28x1 4.50 67.60 from Table 10. What are the 85% confidence levels for this estimate? Here x1 is the estimate of sales of eight visits per month. may be non-linear but a curvilinear function of one independent variable. Regression statistics for coffee sales – three b3 18.14 b2 4.039 units Thus we can say that using this regression model.039 Upper limit is. The confidence limits of sales using the standard error of the estimate of 5.50 * 2. x2 is the shelf space of 3.9336 28.658.591.658.491. and x3 is the unit sales price of coffee of €2.546.05 41.6502 * 5. the increase of HIV contamination in For the confidence intervals we use equation 10(xi) and [function TINV] in Excel.26 b1 2.

8.7 is given in Figure 10. 6. and the increase in the sale of DVD players.6456x 849.594.8 Second-degree polynomial for house prices.000. Curvilinear relationships can take on a variety of forms as discussed below.000. c.0575x2 9.000.000 5. A second-degree or quadratic polynomial function. b.828. The regression equation and the corresponding coefficient of determination are as follows: ˆ y r2 41. ● ● Options Display equation on chart and Display R-squared value on chart.6456x R 2 0.000.9 we have the regression function where x has a power of 6. To do this we first select the data points on the graph and then from the [Menu chart] proceed sequentially as follows: ● ● In Microsoft Excel we have the option of a polynomial function with the powers of x ranging from 2 to 6.352 Statistics for Business Figure 10. Polynomial function A polynomial function. where x has a power of 2 for the surface area and house price data of Table 10.000.000 1.9653 Add trend line Type polynomial power In Figure 10.1408 4. k are constants: y a bx cx2 dx3 … kxn 10(xiv) Since we only have two variables x and y we can plot a scatter diagram.594.1408 0. The regression equation .000 y 41. d.0575x 2 9. Once we have the scatter diagram for this bivariate data we can use Microsoft Excel to develop the regression line. takes the following general form where x is the independent variable and a.000.000 2.000 Price (€) 3. ….000 0 0 50 100 150 200 250 Area (m2) 300 350 400 450 500 Africa.9653 849.828.

712.1409x2 113.000.9729 4.1409x2 113. For example.0086x r2 0.000 Price (€) 3.0430x4 12.0001x5 0.5790 R2 0.Chapter 10: Forecasting and estimating from correlated data 353 Figure 10.000.9298 r 2 We can see that as the power of x increases the closer is the coefficient of determination to unity or the better fit is the model.000 0 0 50 100 150 200 250 Area (m2) 300 350 400 450 500 and the corresponding coefficient of determination are as follows: ˆ y 0.879.0261x3 1.9729 The exponential relationship for the house prices is shown in Figure 10. the coefficient of determination was 0. Note for this same data when we used linear regression.0000x6 0.5790 0. 6.9 Polynomial function for house prices where x has a power of 6.000 y 5.712.719.000 2. respectively.0001x5 0. in the Northern hemisphere the sale of swimwear is higher in the spring and summer than in the autumn and winter.000. Seasonal Patterns in Forecasting In business.7.0430x4 12.8930x 2.000 1. and a and b are constants: y aebx 10(xv) . and the sale of cold beverages is higher in the summer than in the winter.000.415.000.0261x3 1.879.000.8623. Figure 10.039.039.8930x 2. seasonal patterns often exist.10 and the following is the equation with the corresponding coefficient of determination: ˆ y 110.9913e0. discussed Exponential function An exponential function has the following general form where x and y are the independent and dependent variables. The linear regression analysis for a time series analysis. particularly when selling is involved. The demand for heating oil is higher in the autumn and winter.719.000 0.

0 * summer(n) 1. Use the information in Table 10. 6.10 Exponential function for surface area and house prices. Plot the actual data and see if a seasonal pattern exists The actual data is shown in Figure 10.5 * winter(n) 1.000. Here we determine the average value around a particular season for a 12-month period. For example. Step 1. 1. can be modified to take into consideration seasonal effects.000.0086x R 2 0.000.9298 5.000 2. or four quarters. Note that for the x-axis we have used a coded value for each season starting with winter 2000 with a code value of 1.13 gives the past data for the number of pallets of soft drinks that have been shipped from a distribution centre in Spain to various retail outlets on the Mediterranean coast.415.000.000 1. The following application illustrates one approach.000. the following relationship indicates how we calculate the centred moving average around the summer quarter (usually 15 August) for the current year n: 0.000 0 0 50 100 150 200 250 Area (m2) 300 350 400 450 500 early in the chapter.9913e0.354 Statistics for Business Figure 10.000 110.0 * autumn(n) 0.000 Price (€) 3.11 and from this it is clear that the data is seasonal. Determine a centred moving average A centred moving average is the average value around a designated centre point. Step 2.0 * spring(n) 1.000 y 4.000. .5 * winter(n 1) 4 For example if we considered the centre period as summer 2000 then the centred Application of forecasting when there is a seasonal pattern: Soft drinks Table 10.13 to develop a forecast for 2006.

000 20.683 18.823 16.031 2001 2002 Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn 2004 2005 Figure 10.595 16.156 19.295 19.000 21.443 15.000 15.Chapter 10: Forecasting and estimating from correlated data 355 Table 10.000 Sales (pallets) 19.909 20. Quarter Actual sales (pallets) 14.342 20.844 15.152 18.688 17.000 16.769 18.000 14.707 17.577 20.13 Year 2000 Sales of soft drinks.480 17.028 17.000 18.226 19.064 19.503 19.000 17.730 16.11 Seasonal pattern for the sales of soft drinks. 23.000 22.081 Year 2003 Quarter Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Actual sales (pallets) 18.000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Quarter (1 winter 2000) .948 16.665 15.

The values for the centred moving average for the complete period are in Column 5 of Table 10.02 1.0294 1.462. 844 1.41 16.31 20.68 16.22 17.0 * 16.845. 443 0.315.55 20.480 17.892.06 0. 665 1.246.125.15 16.68 19.21 18.37 20.135.693.592.44 18.69 15.97 9 Regression ^ forecast.414.06 0.730 16.271.33 18.9809 1.17 17.02 1.438.5 * 14.707 17.909 20.00 18.02 1.0 * 15.593.747.670.97 15.13 17.226 19.30 15.646.006. 665 1.052.9631 0.361.91 17.74 Quarter Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn 2001 2002 2003 2004 2005 15.503 19.081 18.824.743.05 20.60 16.9856 1.82 19.131.06 0.665 15.97 1.0588 0.0 * 15.900.9424 0.42 18.88 We are determining a centred moving average and so the next centre period is autumn 2000.518.100.130.595 16. 823 0.50 18.13 20.028 17.279.730 1.914.577 20.356 Statistics for Business moving average around this quarter using the actual data from Table 10.0103 .9698 1.98 19.95 0.97 1.9305 0.02 1.064 20.36 20.5 * 16.60 19.856 19.06 0.208.792.84 16.515.706.769 18.844 15.965 18.95 0.688 17. 823 4 15.14.0147 1.240.20 20.0279 1.0041 1.50 1.63 17.035.95 0.948 16.50 19.776.38 17.06 0.977.38 19.719.899.00 16.97 1.996.669.72 17.55 19.958.096.29 18.95 8 Sales/SI 15.427.731.63 19.75 19.07 19.792.10 18.06 0.0654 0.382.0 * 15.50 * 15.03 19.342 21. 3 Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 4 Actual sales (pallets) 14.5 * 15.13 is as follows: 0.0 * 15.00 19.309.295 20.031 5 Centred moving average 6 SIp 7 Seasonal index SI 0.06 17.812.00 16.920. 688 4 16.0565 0.823.494.619.97 1.683 18.95 0. For this quarter.37 17.505.930.97 1.50 16.88 19.93 16.89 20.823 16.0 * 16.362.0646 0.75 17.285.02 1.713.50 17.746. 443 1.38 18.60 18.730 1.439.13 16.00 17.516. we drop the data for winter 2000 and add spring 2001 and thus the centred moving average around autumn 2000 is as follows: 0.286.53 16.443 15.9793 1.15 15.00 Thus each time we move forward one quarter we drop the oldest piece of data and add the next quarter.0552 0. y 15. 035.061.9732 0.75 18.13 18.12 18.14 1 Year 2000 2 Sales of soft drinks – seasonal indexes and regression.054.723.055. Note that we Table 10.616.97 1.67 19.404.95 0.02 1.9339 0.51 19.967.99 16.492.9542 1.88 16.

and 10% below the year for autumn 2004 (1 0.12 Centred moving average for the sale of soft drinks. SI p (Actual recorded sales in a period) (Moving average for the same period) 357 We interpret this by saying that sales in the winter 2004 are 2% below the year (1 0.12. SIp This is the ratio.000 Centred moving average of pallets 19. It gives a specific seasonal index for each month. Determine an average seasonal index.98 This data is in Column 6 of Table 10.000 17.Chapter 10: Forecasting and estimating from correlated data cannot determine a centred moving average for winter and spring 2000 or for summer and autumn of 2005 since we do not have all the necessary information.000 20.90 Figure 10. Divide the actual sales by the moving average to give a period seasonal index. Table 10. are as in Table 10. Step 3. in the spring they are 3% above the year. for the four quarters This is determined by taking the average of all the ratios. Winter 0. For example. Sales of soft drinks – seasonal Spring 1.98). SIp for like seasons.90).15.000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Coded period (1 winter 2000) .000 15. 6% above the year for the summer.15 indexes. if we consider 2004 the ratios.000 16. What we have done here is compared actual sales to the average for a 12-month period. rounded to two decimal places.03 Summer 1. SI. Step 4. For example.06 Autumn 0. The line graph for this centred moving average is in Figure 10. 21.14.000 18.

000 20. Step 5.14. but rounded to two decimal places.9740 1. 21.13.000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Coded period (1 winter 2000) .0601 0.13 Sales/SI for soft drinks.000 17.0588 1.000 19. The regression equation and the 1. are shown in Column 7 of Table 10. SI This data is shown in Column 8.16 indexes. Another way to say is that the sales are deseasonalized.14. What we have done here is removed the seasonal effect of the sales. the values are the same.0565 1. These same indices.0601 The seasonal indices for the four seasons are in Table 10.0000 Figure 10. Note that the average value of Table 10.000 Sales/SI 18. Season Summer Autumn Winter Spring Average Sales of soft drinks – seasonal SI 1.0173 1.9486 0. The line graph for these deseasonalized sales is in Figure 10.358 Statistics for Business the seasonal index for the summer is calculated as follows: 1. Develop the regression line for the deseasonalized sales The regression line is shown in Figure 10. and just showed the trend in sales without any contribution from the seasonal period. Divide the actual sales by the seasonal index.0552 1. Note.0646 5 these indices must be very close to unity since they represent the movement for one year.000 16.16.000 15.0654 1. Step 6. for similar seasons.

8451 9.9673 650. slope of the line r2.9673 15. b. Alternately.073.17.931 a. use [function 359 Alternatively we can use in Excel [function LINEST] by entering from Table 10.207.17 Sales of soft drinks – seasonal indexes. Figure 10.000 15.861 15.4554 129. continue the rows down for 2006 using the code values of 25 to 28 for the four seasons.8451x 0.14 Deseasonalized sales and regression line for soft drinks.000 16.11 the x-values of Column 1 and the y-values from Column 8 to give the statistics in Table 10. standard error of estimate degrees of freedom (n 2) .9673 Table 10.14.3687 307.Chapter 10: Forecasting and estimating from correlated data corresponding coefficient of determination are as follows: ˆ y r2 230.0539 0.207. From the regression line forecast deseasonalized sales for the next four quarters This can be done in two ways. intercept on the y-axis se.4554 Using the corresponding values of a and b we have developed the regression line values as shown in Column 9 of Table 10. Either from the Excel table.0335 22 2.000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Coded period (1 winter 2000) 230. Step 7.000 22. 23.000 18.000 20.8451x 15.000 y 21.0810 61. coefficient of determination 230.282.000 17.000 Sales/SI 19.4554 R 2 0.207.

the longer the time period the more uncertain is the model because of the changing environment – What new technologies will come onto the market? What demographic changes will occur? How will interest rates move? An approach to recognize this is to develop forecast models for different time periods – say short. assume that you want to forecast sales of a certain product of which there are six different models.43 21. and long-term. Collected data Quantitative forecast models use collected or historical data to estimate future outcomes. For example. However. In collecting data it is better to have detailed rather than aggregate information.360 Statistics for Business Table 10. Thus when we use statistical analysis to forecast future patterns we have to exercise . price increases. or exchange Considerations in Statistical Forecasting We must remember that a forecast is just that – a forecast.06 0. medium. y 20. it can be very quickly executed using an Excel spread sheet and the given functions. caution when we interpret the results. Step 8.95 6 Regression ^ forecast.432 21.440.557 5 Seasonal index SI 0. Multiply the forecast regression sales by the SI to forecast 2006 seasonal sales The forecast seasonal sales are shown in Column 4 of Table 10.729 20. Time horizons Often in business. The following are some considerations.18.97 1.18. However.576 22.15.18 1 Year 2 Sales of soft drinks – forecast data.978.27 21.671. What we have done is reversed our procedure by now multiplying the regression forecast by the SI.58 21. managers would like a forecast to extend as far into the future as possible. The actual and forecast sales are shown in Figure 10. These values are in Column 6 of Table 10.02 1. 3 Code 4 Forecast sales (pallets) 20.209. revenues can be distorted by market changes. You could develop a model of revenues for all of the six models. Although at first the calculation procedure may seem laborious. The forecast model for the shorter time period would provide the most reliable information.12 Quarter 2006 Winter Spring Summer Autumn 25 26 27 28 FORECAST] where the x-values are the code values 25 to 28 and the actual values of x are the code values 1 to 24 and the actual values of y are the deseasonalized sales for these same coded periods. When we developed the data we divided by the SI to obtain a deseasonalized sale and used the regression analysis on this information. as the latter might camouflage situations.

02 1. Table 10.19.33 1. α/μ Collected data. It would be better first to develop a time series model on a unit basis according to product range. For example. Product A 1.257 1.024 1.000 Sales (pallets) 20.254 164.58 323.Chapter 10: Forecasting and estimating from correlated data 361 Figure 10.502 1.320 1.259. consider the time series data in Table 10.080 1.13 Product B 800 40 564 12 16 456 56 12 954 377. This base model would be useful for tracking inventory movements.000 14.000 15. is an indicator of how reliable is a forecast model. the coefficient of variation of the data.000 18.000 Forecast 23.000 21. It can then be extended to revenues simply by multiplying the data by unit price.15 Actual and forecast sales for soft drinks. .425 1. or the ratio of the standard deviation to the mean (α/μ).370 1.100 1.19 Period January February March April May June July August September s (as a sample) μ Coefficient of variation. 24.17 Coefficient of variation When past data is collected to make a forecast.000 16.000 22.000 19.000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Quarter (1 winter 2000) rates if exporting or importing is involved.11 0.000 17.

whether it is estimated at 10%. In this case a forecast model should be quite reliable. Curvilinear or exponential models We must exercise caution in using curvilinear ˆ functions. However plastic and composite materials are rapidly replacing steel. In the worked example. or be prepared for the threats. For Using this for a surface area of 1.1408 Models are dynamic A forecast model must be a dynamic working tool with the flexibility to be updated or modified as soon as new data become available that might impact the outcome of the forecast. σ ˆ. accuracy must be judged in light of control a firm has over resources and external events. or demographic reasons. surface area and house prices. Similarly. in the past. These types of events may not affect short-term planning but certainly are important in long-range forecasting when capital appropriation for plant and equipment is a consideration. as an estimate of the population standard deviation. In the classic life cycle curve in marketing. 20%. Model accuracy All managers want an accurate model. more and more uses are being found for plastics. The accuracy of the model. we developed the following twodegree polynomial equation: ˆ y 41. s. In using the coefficient of variation as a guide. models for the European Economy have been modified to take into account the impact of the Euro single currency. steel requirements might be correlated with the forecast sale of automobiles.6456x 849. the growth period for successful new products often follows a curvilinear or more precisely an exponential growth model but this profile is unlikely to be sustained as the product moves into the mature stage.0575x2 9. Here a forecast model would not be as reliable.000 m2 forecasts a house price of €32. Note that in determining the coefficient of variation we have used the sample standard deviation.362 Statistics for Business For product A the coefficient of variation is low meaning that the dispersion of the data relative to its mean is small. In situations like this perhaps there is a seasonal activity of the product and this should be taken into account in the selected forecast model. an exponential growth often cannot be sustained in the future often because of economic. For example. so this element would need to be incorporated into a forecast for the demand for plastics. Alternatively. example. where the predicted value y changes rapidly with x. Even though the actual collected data may exhibit a curvilinear relationship.594. plotting the data on a scatter diagram would be a visual indicator of how good is the past data for forecasting purposes. for Product B the coefficient of variation is greater than one or the sample standard deviation is greater than the mean. Besides accuracy. care should be taken as if there is a trend in the data that will of course impact the coefficient. which is . market. also of interest in a forecast is when turning points in situations might be expected such as a marked increase (or decrease) in sales so that the firm can take advantage of the opportunities. or say 50% can only be within a range bounded by the error in the collected data. As already discussed in the chapter. so this factor would distort the forecast demand for steel if the old forecasting approach were used. an economic model for the German economy had to be modified with the fall of the Berlin Wall in 1989 and the fusion of the two Germanys.828. Further.3 million. Market changes Market changes should be anticipated in forecasting. On the other hand.

248 units for 2010.000 3.Chapter 10: Forecasting and estimating from correlated data 363 Figure 10. Here we developed a linear regression model that gave a coefficient of determination of 0.400 4. The model basis was a combined multiple regression and curvilinear relationships incorporating variables in the United States economy such as changes in the GNP.0000e0.400 2. Now if we develop an exponential relationship for this same data then this would appear as in Figure 10. The equation describing this curve is. refinery. Consider also the sale of snowboards worked example presented at the beginning of the chapter. If a quantitative forecast model is used there needs to be consideration of subjective input. The activity may be a trial and error process selecting a model and testing it against actual data or opinions.600 1.600 2. and chemical plant operation.200 4.000 1. In the 1980s.16 Exponential function for snowboard sales. 4. in a marketing function in the United States.2479x Selecting the best model It is difficult to give hard and fast rules to select the best forecasting model.400 3.400 1.2479x R 2 0. chemical The data gives a respectable coefficient of determination of 0. and vice-versa. This model was needed to estimate financial returns from future oil exploration.600 3. If we use this to make a forecast for sale of snowboards in 2010 we have a value of 2. 20 06 . drilling. ˆ y e0.800 2.9191 Snowboards sold (units) 95 96 97 98 99 00 01 02 03 04 05 20 19 19 19 19 20 20 20 20 20 Year beyond the affordable range for most people.200 1.9316 and the model forecast sales of 3. Models can be complex.000 800 600 400 200 0 89 90 91 92 93 94 19 19 19 19 19 19 19 y 0.800 3.800 1. energy consumption.62 10216 which is totally unreasonable. interest rates.200 3. I worked on developing a forecast model for world crude oil prices.000 2.200 2.16.9191.

A series of forecast models have been developed by a group of political scientists who study the United States elections. demographic changes. . Jim Melcher. y. To forecast using the regression equation. Correlation is the strength between these variables and can be illustrated by a scatter diagram. and country political risk. r. The coefficient of correlation can be positive or negative whereas the coefficient of determination is always positive. Both of these coefficients have a value between 0 and 1. The models have been used in all the United States elections since 1948 and have proved highly accurate. x. The closer either is to unity then the stronger is the correlation. These models use combined factors such as public opinions in the preceding summer. the model was tested against known situations. Mathematically. we insert the time.364 Statistics for Business production and forecast chemical use. into the regresˆ. 1 September 2000. A time series and correlation A time series is bivariate information of a dependent variable. and seasonal patterns in data. If the correlation is reasonable. a and b are constants. and x is the time. Throughout the development. International Herald Tribune. linear and multiple regression. sion equation to give a forecast value y The variability around the regression line is measured by the standard error of the estimate. a money manager based in New York.2 In 2007 the world economy suffered a severe decline as a result of bank loans to low income homeowners. y a bx. The regression equation gives the best straight line that minimizes the error between the data points on the regression line and the corresponding actual data from which the regression line is developed. capital expenditure. seasonal effects.3 Chapter Summary 2 3 This chapter covers forecasting using bivariate data and presents correlation. Linear regression in a time series data ˆ ˆ The linear regression line for a time series has the form. such as sales with an independent variable x representing time. 20 August 2007. knowing a and b. are two numerical measures to record the strength of the linear relationship. the strength of the economy. Warnings were missed in US loan meltdown. International Herald Tribune. taxation. then regression analysis is the technique to develop an equation that describes the relationship between the two variables. se. using complex derivative models forecast this downturn and pulled out of this risky market and saved his clients millions of dollars. where y is the predicted value of the dependent variable. The coefficient of correlation. We can use the standard error of the estimate to give ˆ ˆ the confidence in our forecast by using the relationship y zse for large sample sizes and y tse for sample sizes no more than 30. Gore is a winner. and the public’s assessment of its economic wellbeing. r2. The model proved to be a reasonable forecast of future prices. and the coefficient of determination.

Seasonal patterns in forecasting Often in selling seasonal patterns exist. Forecasting using multiple regression Multiple regression is when there is more than one independent variable x to give an equation of ˆ the form.Chapter 10: Forecasting and estimating from correlated data 365 Linear regression and casual forecasting We can also use the linear regression relationship for causal forecasting. care must be taken in using curvilinear models as though the coefficient of determination indicates a high degree of accuracy. In causal forecasting all of the statistical relationships of correlation. Here the assumption is that the predicted value of the dependent variable is a function not of time but another variable that causes the change in y. A coefficient of multiple determination. In this case we develop a forecast model by first removing the seasonal impact to calculate a seasonal index. Further. variability. r2. measures the strength of the relationship between the dependent variable y and the various independent variables x. Again with both these relationships we have a coefficient of determination that illustrates the strength between the dependent variable and the independent variable. When we multiply the regression forecast by the seasonal index we obtain a forecast by season. If we divide the actual sales by the seasonal index we can then apply regression analysis on this smoothed data to obtain a regression forecast. se. the model may not follow market changes. Alternatively it may have an exponential relationship of the form y aebx. The only difference is that the value of the independent variable x is not time. Forecasting using non-linear regression Non-linear regression is when the variable y is a curvilinear function of the independent variable x. Other considerations are that we should work with specific defined variables rather than aggregated data and that past data must be representative of the future environment for the model to be accurate. and again there is a standard error of the estimate. prediction. Considerations in statistical forecasting When we forecast using statistical data the longer the time horizon then the more inaccurate is the model. and confidence level of the forecast apply exactly as for a time series data. y a b1x1 b2x2 b3x3 … bkxk. The function may be a polynomial relationship of the form y a bx cx2 dx3 … kxn. .

37 0. What quantitative data indicates that there is a reasonable relationship over time with the safety incidents reported by ExxonMobil? 5.000 hours 1.06 0.000 hours worked since 1995.82 0.65 0. ExxonMobil implemented worldwide its Operations Integrity Management System (OIMS). Using the regression information. 2007. Plot the data on a scatter diagram.4 Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Incidents per 200. what might you conclude about the future safety record of ExxonMobil? 4 Managing risk in a challenging business.51 0. 2.366 Statistics for Business EXERCISE PROBLEMS 1. The Lamp.35 1. Using the regression equation what is a forecast of the number of reported incidents in 2010? What are your comments about this result? 7. (2). a programme that Exxon itself had developed in 1992 in part as a result of the Valdez oil spill in Alaska in 1989. Using the regression equation what is a forecast of the number of reported incidents in 2007? 6. p.38 0.72 0. Since the implementation of OIMS the company has experienced fewer safety incidents and its operations have become more reliable. Safety record Situation After the 1999 merger of Exxon with Mobil. These results are illustrated in the table below that shows the total incidents reported for every 200. From the data. Develop the linear regression equation that best describes this data. the newly formed corporation. what is annual change in the number of safety incidents reported by ExxonMobil? 4.38 0. .84 0. 26. ExxonMobil. 3.25 Required 1.98 0.

Office supplies Situation Bertrand Co. rubber bands. What might be an explanation for the relative increase in sales for the months of August and September? 4. pencils. Month January 2003 February 2003 March 2003 April 2003 May 2003 June 2003 July 2003 August 2003 September 2003 October 2003 November 2003 December 2003 January 2004 February 2004 March 2004 April 2004 May 2004 June 2004 July 2004 August 2004 September 2004 October 2004 November 2004 December 2004 £ ‘000s 14 18 16 21 15 19 22 31 33 28 27 29 26 28 31 33 34 35 38 41 43 37 37 41 Month January 2005 February 2005 March 2005 April 2005 May 2005 June 2005 July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 August 2006 September 2006 October 2006 November 2006 December 2006 £ ‘000s 42 43 42 41 41 42 43 49 52 47 48 49 51 50 52 54 57 54 48 59 61 57 56 61 Required 1. develop a time series scatter diagram for this information. diaries. pens.Chapter 10: Forecasting and estimating from correlated data 367 2. For a particular geographic region the company records over a 4-year period indicated the following monthly sales in pound sterling as follows. computer paper. What is an appropriate linear regression equation to describe the trend of this data? 3. is a distributor of office supplies including agendas. 2. and the like. paper clips. Using a coded value for the data with January 2003 equal to 1. What can you say about the reliability of the regression model that you have created? Justify your reasoning. .

989 8.720 7.400 11.300 10.543 12.368 Statistics for Business 5. Is the linear equation a good forecasting tool for forecasting the future value of the road deaths? What quantitative piece of data justifies your response? 4. p.000 8.333 7.600 9. Using the regression information. What are the average quarterly sales as predicted by the regression equation? 6.833 11. 2. .967 Year 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 Deaths 9. Plot the data on a scatter diagram.500 8.333 10.548 10. Road deaths Situation The table below gives the number of people killed on French roads since 1980. what is the forecast of road deaths in France in 2030? 7.580 7.400 12.5 Year 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 Deaths 12. what is the yearly change of the number of road deaths in France? 5.967 7.067 7.855 10.833 9.242 Required 1. Develop the linear regression equation that best describes this data. What are your comments about the model you have created and its use as a forecasting tool? 3. and December 2009? Which would be the most reliable? 7. What would be the forecast of sales for June 2007. 2. Using the regression information. 3.500 10.333 8.083 8. what is the forecast of road deaths in France in 2010? 6. Using the regression information. What are your comments about the forecast data obtained in Questions 5 and 6? 5 Metro-France 16 May 2003. December 2008.

Since the restaurant is open for lunch and dinner 7 days a week there are times that the restaurant does not have the full complement of staff. Carbon dioxide is one of the gases widely believed to cause global warming. What are your comments about using this information for forecasting? 5. Carbon dioxide Situation The data below gives the carbon dioxide emissions. Restaurant serving Situation A restaurant has 55 full-time operating staff that includes kitchen staff and servers. in millions of metric tons carbon equivalent.800 1. p. From the answer in Question 3. 2. there are times when 6 Insurers weigh moves on global warming.650 1.Chapter 10: Forecasting and estimating from correlated data 369 4. forecast the carbon dioxide emissions in North America for 2020? 7.600 1.800 Required 1.825 1.625 1.850 1. . 1.6 Year 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 North America 1. 7 May 2003. What are the indicators that demonstrate the strength of the relationship between carbon dioxide emission and time? What are your comments about these values? 3. what is your 95% confidence limit for this forecast? 6. What is the annual rate of increase of carbon dioxide emissions using the regression relationship? 4. forecast the carbon dioxide emissions in North America for 2010? 5. for North America. Using the regression equation. Using the regression equation. Plot the information on a time series scatter diagram and develop the linear regression equation for the scatter diagram.660 1.750 1. CO2. Wall Street Journal Europe.790 1. In addition.

Product sales Situation A hypermarket made a test to see if there was a correlation between the shelf space of a special brand of raison bread and the daily sales. This information is given in the table below. 7. If there are six employees absent. The restaurant manger conducted an audit to determine if there was a relationship between the number of staff absent and the average time that a client had to wait for the main meal. estimate the average waiting time as predicted by the linear regression equation. The following is the data that was collected over a 1-month period. to the nearest two decimal places. develop a scatter diagram between number of staff absent and the average time that a client has to wait for the main meal. Number of staff absent 7 1 3 8 0 4 2 3 5 9 Average waiting time (minutes) 24 5 12 30 3 16 15 20 22 27 Required 1. . what is a quantitative measure that illustrates a reasonable relationship between the waiting time and the number of staff absent? 3. what is the average waiting time to obtain the main meal as predicted by the linear regression equation? 6. What are your comments about this result? 8. estimate the average waiting time as predicted by the linear regression equation. When the restaurant has the full compliment of staff. What is an estimate of the time delay per employee absent? 5. Using regression analysis. For the information given. What is the linear regression equation that describes the relationship? 4.370 Statistics for Business staff are simply absent as they are sick. What are some of the random occurrences that might explain variances in the waiting time? 6. 2. If there are 20 employees absent.

What are your conclusions? 3.7 9.7 10. The Transport Authority was interested to see if they could develop a relationship between the number of users and another easily measurable variable. what is the estimated daily sale of this bread? 4.Chapter 10: Forecasting and estimating from correlated data 371 Shelf space (m2) 0.00 1.25 0.25 1. and August. The following is the data collected: Year 1993 1994 1995 1996 Unemployment rate (%) 11.5 12.25 2.00 2. What does this sort of experiment indicate from a business perspective? 7. If the allocated shelf space was 1. July. In this way they would have a forecasting tool.00 m2. The variables they selected for developing their models were the unemployment rate in this region and the number of foreign tourists visiting Germany. German train usage Situation The German rail authority made an analysis of the number of train users on the network in the southern part of the country since 1993 covering the months for June. If the allocated shelf space was 5. what is the estimated daily sale of this bread? What are your comments about this forecast? 5.4 No.00 2.75 0.50 m2. Develop a linear regression equation for the daily sales and the allocated shelf space.75 1. 2.25 Daily sales units 12 18 21 23 18 23 25 28 30 34 32 40 Required 1. of tourists (millions) 7 2 6 4 Train users (millions) 15 8 13 11 (Continued) .25 2. Illustrate the relationship between the sale of the bread and the allocated shelf space.00 1.50 0.

5 8.000 580. Illustrate the relationship between the number of train users and foreign tourists on a scatter diagram. The table below gives data on a monthly basis for revenues.200 54.000 Sales persons 542 521 328 622 122 No.372 Statistics for Business Year 1997 1998 1999 2000 2001 2002 2003 2004 Unemployment rate (%) 11.7 7. Sales revenues 721.2 6. In any given year. what would be a forecast for the number of train users? 6. if the number of foreign tourists were estimated to be 10 million.2 7. Illustrate the relationship between the number of train users and unemployment rate on a scatter diagram. what are your observations? 8. This data is to be analysed using multiple regression analysis.985 13.7 9. If a polynomial correlation (to the power of 2) between train users and foreign tourists was used. in pound sterling.7 No. sells cosmetic products by simply advertising in throwaway newspapers and by ladies who organize Yam parties in order to sell directly the products.712 25. what are your conclusions about the correlation between the number of train users and the unemployment rate? 3.7 12. 2. of tourists (millions) 14 15 16 12 14 20 15 7 Train users (millions) 25 27 28 20 27 44 34 17 Required 1.000 315. for sales of cosmetics each month for the last year according to advertising budget and the equivalent number of people selling full time.400 Advertising budget 47.5 9. Using simple regression analysis. what are your conclusions about the correlation between the number of train users and the number of foreign tourists? 5.200 770.512 94.000 910. Using simple regression analysis. of yam parties 101 67 82 75 57 . Cosmetics Situation Yam Ltd. 4.

of yam parties 68 84 85 81 37 99 100 Required 1.500 957.000 594.245 36.500 515. and the number of sales persons. 5. 2.352 25.000 sales contacts.000 97.000 Advertising budget 46.000 Sales persons 412 235 435 728 656 856 656 No.Chapter 10: Forecasting and estimating from correlated data 373 Sales revenues 587. From the answer developed in Question 1. Develop a three-independent-variable multiple regression model for the sales revenues as a function of the advertising budget. the number of sales persons. Develop a two-independent-variable multiple regression model for the sales revenues as a function of the advertising budget. Does the relationship appear strong? Quantify. what would be an estimate of the sales revenues for that month? 3. Then what would be an estimate of the sales for that month? 6.000 and there will be 250 sales persons available. What are the 95% confidence intervals for Question 2? 4.450 865. In this case. Hotel revenues Situation A hotel franchise in the United States has collected the revenue data in the following table for the several hotels in its franchise. assume for a particular month it is proposed to allocate a budget of $US 4. From the answer developed in Question 4. What are the 95% confidence intervals for Question 5? 9. Year 1996 1997 1998 1999 2000 2001 2002 Revenues ($millions) 35 37 44 51 50 58 59 (Continued) . Does the relationship appear strong? Quantify.000 to use 30 sales persons. assume for a particular month it is proposed to allocate a budget of £30. with a target to make 21.000 1.000 965.000 77.027. and the number of Yam parties.897 67.847 64.

2. and the 1st quarter 2007. and the number of shares held by Dan. In addition. From the given information develop a two-degree polynomial regression model of the time period against revenues. Table 1 Table Hershey.500 25. he made optional cash investment for new shares. of shares 100. forecast the revenues in 2008 and give the 90% confidence limits. forecast the revenues in 2008. of shares 101. at the end of each quarter since the time of the initial purchase. 5. Dan participated in Hershey’s reinvestment programme.4734 102. Dan bought a round lot (100 shares) in September 1988 for $28.292 26. 6. From the relationship in Question 6. The share price.374 Statistics for Business Year 2003 2004 2005 Revenues ($millions) 82 91 104 Required 1.500 35. What are your comments related to making a forecast for 2008 and 2020? 10. From the relationship in Question 1. From the relationship in Question 6. 10. From the given information develop a linear regression model of the time period against revenues.6919 101. 9. forecast the revenues in 2020.500 per share. That meant he reinvested all quarterly dividends into the purchase of new shares.089 No. is given in Table 1. Pennsylvania.0000 100.9584 End of month September 1988 December March 1989 . From the relationship in Question 1. United States of America. What is the coefficient of determination for relationship developed in Question 6? 8.9373 102. What is the coefficient of determination for relationship developed in Question 1? 3. forecast the revenues in 2020 and give the 90% confidence limits. Since that date. What is the annual revenue growth rate based on the given information? 4.010 No. Hershey Corporation Situation Dan Smith has in his investment portfolio shares of Hershey Company. 7.126 31. a Food Company well known for its chocolate.3673 End of month June September December 1989 Price ($/share) 31. from time to time. Price ($/share) 28.

000 49. How might you explain the shape of the line graph? 2.5055 194.980 53.4406 959.1432 417.995 38. Price ($/share) 31.500 45.375 59.7137 956.896 42.6702 179.4409 409.079 41.6250 End of month December 1998 March 1999 June September December 1999 March 2000 June September December 2000 March 2001 June September December 2001 March 2002 June September December 2002 March 2003 June September December 2003 March 2004 June September December 2004 March 2005 June September December 2005 March 2006 June September December 2006 March 2007 Price ($/share) 63.625 71.6230 407.1457 176.170 73.539 48.1921 971.996 53.350 56.Chapter 10: Forecasting and estimating from correlated data 375 Table 1 (Continued).4516 201.062 63. For the data given and using a coded value for the quarter starting at unity for September 1988.877 55.1860 948.106 44.6034 462.9986 444.928 No.0976 174.500 35.6803 149.722 37.5491 437.2886 975.999 41.867 51.081 55.209 64.8230 963.2394 439.4756 151.5619 150.280 66.1548 456.305 45.7948 980.440 68.665 77.504 67.100 72. develop a time series scatter diagram for the asset value (value of . For the data given and using a coded value for the quarter starting at unity for September 1988.6966 135.272 62.1256 432.0742 445.0405 450.5097 119.625 65.5122 412.0996 175.600 57.3886 152.738 71.3792 192.2047 177.640 48.1304 413. of shares 426.625 49.250 36.938 67.928 49.524 57.3459 441.4739 193.3983 952.618 43.5670 995.001 51.210 53.967 46.2736 430.233 69.750 64.0956 967.810 63.9882 465.3484 990.750 57.280 73.600 66.2219 985.323 48.3312 460. develop a line graph for the price per share.845 52.300 65.8518 120.079 39.2869 153.809 54.239 62.0912 467.254 72.500 53.8740 181.5265 End of month March 1990 June September December 1990 March 1991 June September December 1991 March 1992 June September December 1992 March 1993 June September December 1993 March 1994 June September December 1994 March 1995 June September December 1995 March 1996 June September December 1996 March 1997 June September December 1997 March 1998 June September Required 1.250 60.0627 173.625 50.6189 428.0383 191.261 44.2541 434.9852 134.0789 410.0847 452.2920 458.8302 416.500 52.317 40.9192 405.0003 472.5615 133.404 No. of shares 103.000 61.5043 118.580 84.4068 178.939 46.375 39.6194 470.971 45.5529 414.5411 148.235 50.824 49.1652 454.9798 448.

8. Compact discs Situation The table below gives the sales by year of music compact discs by a selection of Virgin record stores. What is the equation that represents the linear regression line? What information indicates quantitatively the accuracy of the asset value and time for this model? Would you say that the regression line could be used to reasonably forecast future values? From the linear regression equation. the portfolio) of the Hershey stock. what are the upper and lower values of assets at the end of December 2020? What occurrences or events could affect the accuracy of forecasting the value of Hershey’s asset value in 2020? Qualitatively. 2. 9. 11. Develop the linear regression equation that best describes this data.376 Statistics for Business 3. 7. 6. would you think there is great risk for Dan in finding that the value of his assets is significantly reduced when he retires? Justify your response. what is a forecast of the value of Dan’s assets in Hershey stock at this date? At a 95% confidence level. Plot the data on a scatter diagram. what is the annual average growth rate in dollars per year of the asset value of the portfolio? Dan plans to retire at the end of December in 2020 (4th quarter 2020). 4. Show on the scatter diagram graph the linear regression line for the asset value. Using the linear regression equation. 5. Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 CD sales (millions) 45 52 79 72 98 99 138 132 152 203 Required 1. Is the equation a good forecasting tool for CD record sales? What quantitative piece of data justifies your response? .

007 249.438 491.780 1. 6. what is the forecast for CD sales in 2007? From the linear regression function.700 21.537 16.901 332.811 98.528 589.861.264.665 498.765 447.493 26.031.510 25.866 32.Chapter 10: Forecasting and estimating from correlated data 377 3.189 477.579 55.167. www.380 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 7 US Census Bureau.684 1.907 176.7 (This is the same information presented in the Box Opener “Value of imported goods into the States” of this chapter.394 668.797 70.226.637 1.113 876.185 Year Imported goods ($millions) 124.377 1. What are your comments regarding using the linear and polynomial function to forecast compact disc sales? 12.681.750 265.228 151.) Table 1 Year Imported goods ($millions) 14. United States imports Situation The data in Table 1 is the amount of goods imported into the United States from 1960 until 2006. what is the forecast for CD sales in 2020? Does a second-degree polynomial regression line have a better fit for this data? Why? What would be the forecast for record sales calls in 2007 using the polynomial relationship developed in Question 5? 7.048 18.002 212.425 409.690 749.866 45.020 Year Imported goods ($millions) 536.499 103. .784 1.231 1.gov/foreign-trade/statistics/historical goods.991 35. From the linear regression function.148.067 247.758 14. 5.477. 4.260 17.307 1.642 268. Foreign Trade division.088 368.794 918.807 39.418 338.094 1. 8 June 2007. What would be the forecast for record sales calls in 2020 using the polynomial relationship developed in Question 5? 8.census.374 803.

Develop the exponential equation and the corresponding coefficient of determination for the complete data and show this information on the scatter diagram. 1975–1979.000 210.500 502. and polynomial equations developed in Questions 5. and 7 to forecast the value of imports to the United States for 2010.500 1.000 1. 9.500 9.000 37. What are your comments? 5.042.500 2004 16. 1995–1999.000 15. Also give the corresponding coefficient of determination: 1960–1964. 2002–2005.500 990. 3.000 1.500 75.500 785.500 127.200 2006 17. 10. 6. 1985–1989.500 1.124. Using the relationships developed in Question 2.400 682. Use the linear.500 862. Use the equation for the period. 7.500 540.100 52.600 978.000 697. Month January February March April May June July August September October November December 2003 15. English pubs Situation The data below gives the consumption of beer in litres at a certain pub on the river Thames in London. From the scatter diagram developed in Question 1 develop linear regression equations using just the following periods to develop the equation where x is the year.500 2005 16. 6.500 232.500 17. United Kingdom between 2003 and 2006 on a monthly basis.500 20.378 Statistics for Business Required 1.500 832.300 765.500 765.000 827.500 567. Develop the linear regression equation and the corresponding coefficient of determination for the complete data and show this information on the scatter diagram.000 172. Develop the fourth power polynomial equation and the corresponding coefficient of determination for the complete data and show this information on the scatter diagram.000 757.500 105.098.500 715. Discuss your observations and results for this exercise including the forecasts that you have developed. 1965–1969. 2002–2005. 13.000 97.500 8. exponential.198.000 7. 8.700 . 2.000 948.500 720.500 82. Develop a time series scatter data for the complete data. Compare these forecast values obtained in Question 3 with the actual value for 2006.200 45.900 47.000 569.000 22.000 622. developed in Question 3 to forecast United States imports for 2010. what would be the forecast values for 2006? 4.000 8.000 675.

Plot a graph of the centred moving average for the data. The table below gives the sales by quarter since 1997. 7. .415 14. What is your interpretation of this information for 2004? 4.658 13. Determine the ratio of the actual sales to the centred moving average for each quarter.358 13.785 11. Use the multiplication model. That is. United States is a distributor of garden tools.218 11.082 14.350 14.862 15.354 12. regression line.198 13.781 14.276 Year 2001 Quarter Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Sales 13.425 1998 2002 1999 2003 2000 2004 Required 1.177 13.002 14. and forecast.636 15.474 15.294 11.966 13.146 14. 2.948 11. deseasonalized sales.875 12. Develop a forecast by quarter for 2007.332 12.Chapter 10: Forecasting and estimating from correlated data 379 Required 1.184 14. What is the value of the coefficient of determination on the deseasonalized sales data? 6.325 14. Show graphically that the sales for Mersey are seasonal. predict sales by quarter for 2005.327 15.302 12.142 13. What are the seasonal indices for the four quarters using all the data? 5. Mersey Store Situation The Mersey Store in Arkansas.886 12. What is the linear regression equation that describes the centred moving average? 3. What would be an estimate of the annual consumption of beer in 2010? What are your comments about this forecast? 14. All data are in $ ’000s. What are your observations? 2. Develop a line graph on a quarterly basis for the data using coded values for the quarters. winter 2003 has a coded value of 1.251 15. Show graphically the moving average.665 13. Year 1997 Quarter Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Sales 11.584 13.

What are your observations? 2.150 5. and Martinique. 6. S1.245 6.325 7. That is. Develop a forecast by quarter for 2007. 2 July 2007.850 5. What is your interpretation of this information for 2005? 4.750 6. Case: Saint Lucia Situation Saint Lucia is an overseas territory of the United Kingdom with a population in 2007 of 171.523 4. sold per month during 8 Based on information from a Special Advertising Section of Fortune.725 5. What is the linear regression equation that describes the centred moving average? 3. Determine the ratio of the actual sales to the centred moving average for each quarter. winter 2003 has a coded value of 1. p. for a sports store in Redondo Beach.275 5. stunning nature trails.784 3.825 825 75 150 2005 150 450 2. Month January February March April May June July August September October November December 2003 150 375 1.575 8. in units per month.625 7. Plot a graph of the centred moving average for the data.975 7.8 With increased tourism goes the demand for hotel and restaurants. Related to these two hospitality institutions is the volume of wine in thousand litres.200 7.050 225 175 Required 1. Saint Vincent.650 6.900 3.275 4.486 5.000.175 5.625 6. Why are unit sales as presented preferable to sales on a dollar basis? 16.325 1. The Grenadines. and relaxing spas. United States of America during the period 2003 through 2006. Southern California. It is an island of 616 square miles and counts as its neighbours Barbados.985 5.980 6. It is an island with a growing tourist industry and offers the attraction of long sandy beaches.225 750 150 75 2004 75 450 1. Develop a line graph on a quarterly basis for the data using coded values for the quarters.400 5.025 5. What are the seasonal indices for the four quarters using all the data? 5. Swimwear Situation The following table gives the sale of swimwear.100 6.380 Statistics for Business 15. . superb diving in deep blue waters.650 975 150 85 2006 75 525 2.

300 26.800 28.Chapter 10: Forecasting and estimating from correlated data 381 2005.000 26.000 30. Table 1 Month Unit wine sales (1.900 2007 30.000 23. .900 25.550 23. 2006.100 27.000 28. and 2007.100 25.000 26.000 31.700 23. This information is in Table 2.000 31.400 27.500 32.000 25.100 22.000 Tourist bookings 2006 29.500 21.200 28.100 22.000 25.000 22. In addition.000 27.000 litres) 2005 January February March April May June July August September October November December 530 436 522 448 422 499 478 400 444 486 437 501 2006 535 477 530 482 498 563 488 428 430 486 502 547 2007 578 507 562 533 516 580 537 440 511 480 499 542 Table 2 Month 2005 January February March April May June July August September October November December 28.600 27.000 20. the local tourist bureau published data on the number of tourists visiting Santa Lucia for the same period.200 29. This data is given in Table 1.200 24.300 25.000 31.200 Required Use the data for forecasting purposes and develop and test an appropriate model.000 28.800 25.000 31.

This page intentionally left blank .

and platinum by 21%. nickel. gold by 32%.1 Indexing is another way to present statistical data and this is the subject of this chapter. Aluminium. 6 May 2006. .Indexing as a method for data analysis 11 Metal prices Metal prices continued to soar in early 2006 as illustrated in Figure 11. p. 105. lead. The price of silver has risen by some 65%. The Economist. and zinc are included in The Economist metals index curve and here the price of copper has increased by 60% and nickel by 45%.1. which gives the index value for various metals for the first half of 2006 based on an index of 100 at the beginning of the year. 1 Metal prices. copper. economic and financial indicators.

1 Metal prices. 180 170 160 150 140 130 120 110 100 90 06 06 6 06 Index 00 20 y2 20 20 ch ar Ja nu a Ap 1M 1 1 Fe b Silver Economist metals index 1 Gold Platinum 1M ru ar ay 2 ry ril 00 6 .384 Statistics for Business Figure 11.

we might want to know how prices have changed each year or how the productivity of a manufacturing operation has increased over time. the enrolment for the MBA programme is 125 candidates. The subjects treated are as follows: ✔ ✔ ✔ Relative time-based indexes • Quantity index number with a fixed base • Price index number with a fixed base • Rolling index number with a moving base • Changing the index base • Comparing index numbers • Consumer price index (CPI) and the value of goods and services • Time series deflation.16 0. The 3rd column gives the ratio of a particular year to the base value. There may be situations when we are more interested not in the absolute values of information but how data compare with other values.32.1. For example.07 0. Table 11. Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Enrolment 95 97 110 56 64 125 102 54 62 70 Ratio to base value 1. Relative regional indexes (RRIs) • Selecting the base value • Illustration by comparing the cost of labour. Here the data for 1995 is considered the index base value. In Chapter 10. In their simplest form they measure the relative change in time respective to a given base value. The index number is the ratio of a certain value to a base value usually multiplied by 100. time period in years and the 2nd column is the absolute values of enrolment in an MBA programme for a certain business school over the last 10 years from 1995.00 1. The 4th column is the ratio for each year multiplied by 100.1 Enrolment in an MBA programme.67 1. we introduced bivariate time-series data showing how past data can be used to forecast or estimate future conditions. For these situations we use an index number or index value. Weighting the index number • Unweighted index number • Laspeyres weighted price index • Paasche weighted price index • Average quantity-weighted price index. The 1st column is the .Chapter 11: Indexing as a method for data analysis 385 Learning objectives After studying this chapter you will learn how to present and analyse statistical data using index values.65 0. This is the index number. This gives a ratio to the 1995 data of 125/95 or 1.74 Index number 100 102 116 59 67 132 107 57 65 74 Quantity index number with a fixed base As an example of a quantity index consider the information in Table 11. When the base value equals 100 then the measured values are a percentage of the base value as illustrated in the box opener “Metal prices”.57 0.02 1. The index number for the base period is 100 and this is obtained by the ratio (95/95) * 100.32 1. Relative Time-Based Indexes Perhaps the most common indices are quantity and price indexes. If we consider the year 2000.59 0.

8910 2.5015 0.8820 $/litre 0.7850 litres.22 1. . gov/cgi-bin/surveymost.27 1.4972 Ratio to base value 1.0290 2.5392 0. The most common price index is the consumer price index. and Qn is the quantity at another period. or alternatively an increase of 32%.) In this table.4206 0.2 which gives the average price of unleaded regular petrol in the United States for the 12-month period from January 2004.5920 1.8980 1.5308 0. 2 US Department of Labor Statistics.7660 1. and Pn is the price at another period.5310 0.26 1. called the relative price index is. $/gallon 1.05 1. IQ. The general equation for this index. that is used as Here P0 is the price at the base period.15 1. We can interpret this information by saying that enrolment in 2000 is 132% of the enrolment in 1995.4666 0.5123 0. IP.4843 0.32 * 100 132. This other period might be at a future date or after the base period. http://data.26 1. IQ Qn * 100 Q0 11(i) Here Q0 is the quantity at the base period.0410 1.19 1.0090 2. which is called the relative quantity index.4996 0. In 2004 the enrolment is only 74% of the 1995 enrolment or 26% less (100% 74%). we can see that the price of gasoline has increased 28% in the month of June compared to the base month of January.18 Index number 100 105 111 115 126 128 122 119 119 127 126 118 Thus. the general equation for this index.2 Month January February March April May June July August September October November December Average price of unleaded gasoline in the United States in 2004. the index for 2000 is 1. In the European Union the organization concerned is Eurostat.bls. which compares the level of prices from one period to another.19 1.00 1.8330 2.9390 1. calculated in a similar way to the quantity index. is the price index. The data is collected and compiled by government agencies such as Bureau of Labour Statistics in the United Kingdom and a similar department in the United States. In a similar manner to the quantity index.2 (For comparison the price is also given $ per litre where 1 gallon equals 3. it could be a past period or before the base period. a measure of inflation by comparing the general price level for specific goods and services in the economy.11 1. IP Pn * 100 P0 11(ii) Price index number with a fixed base Another common index.5361 0.28 1. Alternatively.4417 0. is.386 Statistics for Business Table 11. Consider Table 11.0100 1.6720 1.

3 Rolling index number of MBA enrolment. The 3rd column gives the number of recorded automobile accidents but in this case the base period of 100 is for the year Changing the index base When the base point of data is too far in the past the index values may be getting too high to be meaningful and so we may want to use a more . In 2002 the index compared to 2001 is calculated by (54/102) * 100 53. consider Table 11. The index values for the other years are determined in the same manner. By transposing the data in this manner we have brought our index information closer to our current year.1.3 which is the same enrolment MBA data from Table 11. The 3rd column shows the index on a basis of 1995 equal to 100.4 Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Retail sales index. Comparing index numbers Another interest that we might have is to compare index data to see if there is a relationship between one index number and another. rather than how it changes according to a fixed base.8160 0. This means that in 1999 there was a 14% increase in student enrolment compared to 1998. For example. In the last column we have an index showing the change relative to the previous year. is (295/295) * 100 100. Year Enrolment Ratio to immediate previous period 1. for the number of driving licences issued.1429 1. The index value for 1995.1481 1.Chapter 11: Indexing as a method for data analysis 387 Table 11. In this case.5091 1.5294 1. This means that enrolment is down 47% (100 53) in 2002 compared to 2001. we would use a rolling index number. As an illustration.0211 1. The index value for 1998 is (322/295) * 100 109.1340 0. recent index so that our base point corresponds more to current periods. gives information relative to a base period of 1960 equal to 100. for example. For example.4 where the 2nd column shows the relative sales for a retail store based on an index of 100 in 1980.1290 Annual change Rolling index 102 113 51 114 195 82 53 115 113 Table 11.5 which is index data for the number of new driving licences issued and the number of recorded automobile accidents in a certain community.9531 0. the previous year. Sales index 1980 100 295 286 301 322 329 345 352 362 359 395 Sales index 1995 100 100 97 102 109 112 117 119 123 122 134 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 95 97 110 56 64 125 102 54 62 70 Rolling index number with a moving base We may be more interested to know how data changes periodically. The 2nd column. Consider Table 11. Again the value of the index has been rounded to the nearest whole number. the rolling index for 1999 is given by (64/56) * 100 114. consider Table 11.

statistics.8]. in the period 2000–2004. in 2000 the index is (406/ 406) * 100 100.3 For this period the CPI has increased by 9. indirect taxes. It is determined by measuring the value of a “basket” of goods in one base period and then comparing the value of the same basket of goods at a later period. For example. This could have been perhaps because of better police surveillance. if you measure your salary increase to the CPI of 9. For this period the CPI has increased by only 0.gov. In this case.388 Statistics for Business Figure 11.) Say now.000 50. Then for example.6 gives the CPI in the United Kingdom for 1990 for all items.000)/50. Table 11.000 and then at the end of 1990 it was increased to £54. the index in 1995 is (307/406) * 100 76 and in 2004 the index is (469/406) * 100 116. consumer goods. the indices are rounded to the nearest whole number. Your salary has increased by an amount of 8% [(£54.000. Table 11. the increase was not as pronounced going from 100 to 112 compared to the number of licenses issued going from 100 to 116.34%. Year Driving Automobile Driving licenses accidents licenses issued 2000 100 issued 1960 100 2000 100 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 307 325 335 376 411 406 413 421 459 469 62 71 79 83 98 100 105 108 110 112 76 80 83 93 101 100 102 104 113 116 CPI and the value of goods and services The CPI is a measure of how prices have changed over time. You have less spending power than you did at the end of 1989 and would not unreasonably be dissatisfied. in order to determine the annual change for 1990. .34% the “real” value or “worth” of your salary has in fact gone down.uk (data. etc. we can see that there appears to be a relationship between the number of new driving licenses issued and the recorded automobile accidents. This basket of goods can include all items such as food. Alternatively. or other reasons. housing costs. The change is most often presented on a ratio measurement scale. a better road infrastructure. However. mortgage interest payments. Now that we have the indices on a common base it is easier to compare the data. More specifically in the period 1995–2000. the CPI can be determined by excluding some of these items.7 which is the CPI in the United Kingdom for 2001 for all items. As an illustration. for example.8)/118. 13 July 2005).5 Automobile accidents and driving licenses issued. Consider now Table 11. [(129. the index for automobile accidents went from 62 to 100 or a more rapid increase than for the issue of driving licences which went from 76 to 100. In both cases.000] and your manager might expect you to be satisfied.2 gives a graph of the data where we can see clearly the changes. your annual salary at the end of 1989 was £50. (Note that we have included the CPI for December 1989. It is inappropriate to compare data of different base periods and what we have done is converted the number of driving licences issued to a base period of the year 2000 equal to 100. where we determined if the change in one variable was caused by the change in another variable. Comparing index numbers has a similarity to causal regression analysis presented in Chapter 10. When there is a significant increase in the CPI then this indicates an inflationary period.70% 3 2000.9 118. However. http://www.

1 126.7 126.Chapter 11: Indexing as a method for data analysis 389 Figure 11.4 125. 2001.4 December 1989 January 1990 February March April May June July August September October November December 1990 December 2000 January 2001 February March April May June July August September October November December 2001 .5 120.2 173.8 119.7 Month Consumer price index.6 174.2 126. 120 110 Index number: 2000 100 100 90 80 70 60 50 1994 1995 1996 1997 1998 1999 2000 Year 2001 2002 2003 2004 2005 Automobile accidents Issued automobile licences Table 11.2 121. 1990.9 Table 11.0 172.6 Month Consumer price index.0 129.0 174.6 173.3 173.2 171.3 174.4 173.1 174.2 Automobile accidents and driving licences issued.3 130.1 172. Index 172.8 128.1 129.3 130. Index 118.2 174.

54. Time series deflation is illustrated as follows using first the information from Table 11. I-values. the salary index to the base period is.9 98. 000 * 100 50.25 This means to say that the real value of the salary has increased by 7.77). 000 100 Ratio of the CPI at the base period to the new period is 172. This person should be satisfied as compared to the CPI increase of 0.2 173. 000 129. the base salary index is. 54. 108 * 0. over the same period.9145 98. n.8 * * 100 50.4 0. At the end of 2001. if you have a time series.23% (100. Say now a person’s annual salary at the end of 2000 was £50. 000 173.6: Base value of the salary at the end of 1989 is £50. the salary index to the base period is. if we substitute in equation 11(iii) the salary and CPI information for 2000 we have the following: RVI 54.25 This means a real increase of 7.000]. In summary.9145 RVI xn I 0 * * 100 x0 I n Multiply the salary index in 1990 by the CPI ratio to give the real value index (RVI) or.77 This means to say that the real value of the salary has in fact declined by 1.000 50.8 129. the base salary index is. 000 * 100 50.25%.23%.2 * * 100 50.000/year This means a real decrease of 1.77 If we substitute in equation 11(iii) the salary and CPI information for 1990 we have the following: RVI 54.9931 107. 000 * 100 50.4 172.9 0. in this case salary from the previous section. 000 100 At the end of 2001. .25%. 108 * 0. x-values of a commodity and an index series. 118.000/year At the end of 1989.00 98.9931 Multiply the salary index in 2001 by the CPI ratio to give the RVI or. 000 118.70% there has been a real increase in the salary and thus in the spending power of the individual. The salary increase is 8% as before [(£54. If we do the same calculation using the CPI for 2001 using Table 11.000)/50.2].7 then we have the following: Base value of the salary at the end of 2000 is £50. Similarly. 50. 000 108 Ratio of the CPI at the base period to the new period is.2)/172. 50.390 Statistics for Business [(173. 000 * 100 50. then the RVI of a commodity for this period is.4 107.000. RVI Current value of commodity Base value of commodity * Base indicator * 100 Current indicator 11(iii) At the end of 1990. we can use time series deflation. 000 172. 000 108 Time series deflation In order to determine the real value in the change of a commodity.000 and then at the end of 2001 it was £54.

United States Index. we first decide what our base point is for Table 11. Again. France 100 100 100 Australia Belgium Britain Canada Czech Republic France Greece Ireland Japan Luxembourg New Zealand Poland Portugal Slovakia South Korea Spain United States 139 121 130 109 100 164 155 148 97 152 127 106 152 133 76 112 100 107 93 100 84 77 126 119 114 74 116 98 81 116 102 58 86 77 85 74 80 67 61 100 94 91 59 93 78 65 93 81 46 69 61 . we multiply the ratio by 100 so that the calculated index values represent a percentage change.8 are data on the cost of labour in various countries in terms of the statutory The cost of labour.8 Country Illustration by comparing the cost of labour In Table 11. One of the criteria for selection was the cost of labour in various selected European countries compared to the United States.Chapter 11: Indexing as a method for data analysis Notice that the commodity ratio and the indicator ratio are in the reverse order since we are deflating the value of the commodity according to the increase in the consumer price. we might be interested to compare the cost of living in London to that of New York. As an illustration. Britain Index. from this base value. Selecting the base value When we use indexes to compare regions to others. Paris. Minimum wage plus social security contributions as percent of labour cost of average worker (%) 46 40 43 36 33 54 51 49 32 50 42 35 50 44 25 37 33 Index. This type of comparison is illustrated in the following example. when I was an engineer in Los Angeles our firm was looking to open a design office in Europe. comparison and then develop the relative regional index (RRI). and Los Angeles or the productivity of one production site to others. For example. When we use indexes in this manner the time variable is not included. Tokyo. Relative regional index Value at other region Value at base region V0 * 100 * 100 Vb 391 Relative Regional Indexes Index numbers may be used to compare data between one region and another.

we have converted the labour cost value into an index using the United States as the base value of 100. and 30% more in Britain than in the United States.00 0.9 Eleven products purchased in a hypermarket. IP Pn * 100 P0 96.10 2.50 6. the index for Australia is 139. Unweighted index number Consider the information in Table 11. Item and unit size (weight.9 to give the Table 11. or unit) Bread. for example. 16% more in Portugal and 16% less in Canada.50 5. 88. index number means that each item in arriving at the index value is considered of equal importance. We interpret this data in Column 4. litre Fish.93 22. kg Chicken.35 4.35 3. relocate to lower cost regions.07 To the nearest whole number this is 129.50 1. Column 4 of Table 11.16 * 100 74. for example.60 27. For example. particularly manufacturing. 24% less in South Korea than in the United States (100% 76%). P0 (£/unit) 1.4 In Column 3.95 74. 200 g Cheese.45 5.60 20. [(43%/33%) * 100].392 Statistics for Business minimum wage plus the mandatory social security contributions as a percentage of the labour costs of the average worker in that country. In the weighted index number. 2 April 2005. a laptop computer is added to Table 11.50 2005. The Economist. 75 cl Instant coffee. for example.50 0. volume. litre Total 2000. .10 3.16 Weighting the Index Number Index numbers may be unweighted or weighted according to certain criteria. the cost of labour in Australia is 15% less than in France. The unweighted 4 Economic and financial indicators.50 1. assume that an additional item.70 96. Here. kg Cereals.8 gives comparisons using Britain as the base country such that the base value for Britain is 100 [(43%/43%) * 100].20 17.9 that gives the price of a certain 11 products bought in a hypermarket in £UK for the years 2000 and 2005.18 1.90 22. kg Milk. which indicates that in using the items given. each Apples. If we use equation 11(ii) then the price index is. [(46%/33%) * 100] for South Korea it is 76. Now. the labour cost in Australia is 7% more. In fact from Column 5 we see that France is the most expensive country in terms of the cost of labour and this in part explains why labour intensive industries. in Britain it is 20% less.00 0. and in South Korea it is a whopping 54% less than in France. The base values of the other countries are then determined by the ratio of that country’s value to that of the United States. loaf Wine. Pn (£/unit) 1. kg Petrol.70 18. We interpret this index data by saying that the cost of labour in Australia is 39% more than in the United States. by saying that compared to Britain. emphasis or weighting is put onto factors such as quantity or expenditure in order to calculate the index. 750 g Lettuce. p. [(25%/33%) * 100] and for Britain it is 130. prices rose 29% in the period 2000 to 2005. This is determined by the calculation (33%/33%) * 100.50 4.50 129. Column 5 gives similar index information using France as the base country with an index of 100 [(54%/54%) * 100)].

925. In a similar manner we can use equation 11(i) to calculate an unweighted quantity index.50 * 100 7.60 27.51 Laspeyres weighted price index The Laspeyres weighted price index. 466.9 with the addition that here the quantities consumed in the base period 2000 are also indicated. each Apples. after its originator.350. Item and unit size (weight.00 0. Thus. Again using equation 11(ii) the price index is. P0 (£/unit) 1.11 gives the calculation procedure for the Laspeyres price index for the items in Table 11.10 2.50 5. Laspeyres price index in 2000 is.17 for the 6-year period between 2000 and 2005 Thus. For example. litre Laptop computer TOTAL 2000. or unit) Bread. The concept of weighting or putting importance on items of data was first introduced in Chapter 2.50 1. litre Fish.93 22.51 * 100 2. from equation 11(iv). IP Pn * 100 P0 1. 447. Pn (£/unit) 1.50 6. kg Milk. Q0 is the quantity consumed in the base period. a family may purchase Table 11. In determining these price indexes using equation 11(ii).00 2. kg Cereals. 466.Chapter 11: Indexing as a method for data analysis revised Table 11.00 or 100 . kg Chicken.00 0.10 Twelve products purchased in a hypermarket.0. In addition.20 17.50 1.10 3. This is a major disadvantage of an unweighted index as it neither attaches importance or weight to the quantity of each of the goods purchased nor to price changes of high volume purchased items and to low volume purchased items.35 4. 750 g Lettuce.70 1. 200 g Cheese. the quantities in the base period. to be more meaningful we should use a weighted price index. loaf Wine.60 2005. the value of the denominator ∑P0Q0 remains constant for each index and this makes comparison of successive indexes simpler where the index for the first period is 100. volume.925. Note that with this method. 393 49.50 4.10.447.45 5.18 1. ∑ Pn Q0 * 100 ∑ P0Q0 7.50 100.48 or an index of 49 This indicates that prices have declined by 51% (100 49) in the period 2000 to 2005.850. is determined by the following relationship: Laspeyres weighted price index ∑ Pn Q0 ∑ P0Q0 Here.90 22.60 200 lettuces/year but probably would only purchase a laptop computer say every 5 years. Here we have assumed that the quantity of laptop computers consumed is 1⁄ 6 or 0. The Table 11.95 2. Q0 are used in both the numerator and the denominator of the equation. We know intuitively that this is not the case. we have used an unweighted aggregate index meaning that in the calculation each item in the index is of equal importance. kg Petrol. P0 is the price in the base period. 75 cl Instant coffee.50 0.60 20.35 3. ● ● ● * 100 11(iv) Pn is the price in the current period.70 18.00 1.

00 210.986.35 4.00 2. For example.394 Statistics for Business Table 11. The Paasche equation is.50 202.00 * 100 7.00 0. the value of the denominator ∑P0Qn changes according to the period with the value of Qn. Thus.00 90.60 20. 986.050.70 18. volume.550.50 0.18 1.17 2. or unit) Bread. litre Fish. and since we are using the quantities for the base year.350.00 Laspeyres price index in 2000 is.60 2005.50 5.20 17.93 22.00 0. ● ● ● ∑ Pn Q0 * 100 ∑ P0Q0 9. P0 (£/unit) 1.60 27.00 1.50 1. kg Chicken.00 475.00 345.00 1. Thus.51 Quantity (units) consumed in 2000. we can determine a new index for 2003. ∑ PnQn ∑ P0Qn * 100 11(v) Pn is the price in the current period.70 1.425.10 3.00 2. which has the same prices for the base period but . kg Milk.500 0.11 Laspeyres price index. loaf Wine. if we had prices in 2003 for the same items.490.50 1. 2000. Qn is the quantity consumed in the current period n.00 3.00 1.00 260.35 3.90 22. is calculated in a similar manner to the Laspeyres index except that now current quantities in period n are used rather than quantities in the base period.00 279.466. P0 is the price in the base period. 75 cl Instant coffee.10 2. Paasche price index Here.850. kg Cereals.50 6.76 or 134 rounding up. 200 g Cheese.00 65.45 5.350. The Paasche price index is illustrated by Table 11. Q0 150 120 50 60 25 100 25 120 300 40 1.00 9.00 112.00 7. This is the same as saying that in this period prices have increased by 34%. For example.00 225.447.17 P0*Q0 Pn*Q0 Item and unit size (weight.00 414.00 900. Pn (£/unit) 1. the Laspeyres weighted price index. A disadvantage with this method is that it does not take into account the change in consumption patterns from year to year. unlike.00 1. if we have selected a representative sample of goods we conclude that the price index for 2005 is 134 based on a 2000 index of 100. Paasche weighted price index The Paasche price index.50 4.00 129.50 540.12.50 135. we may purchase less of a particular item in 2005 than we purchased in 2000. 466.50 133.925.460. kg Petrol. litre Laptop computer TOTAL 165. each Apples. 750 g Lettuce. With the Laspeyres method we can compare index changes each year when we have the new prices. in the Paasche weighted price index.240.50 110.00 2.95 2. again after its originator.00 720.

These revised quantities show that perhaps the family is becoming more health conscious. loaf Wine.00 279.00 1. 891. litre Laptop computer TOTAL 82. 400.100. apples.00 45. 75 cl Instant coffee.00 225.00 10. Pn (£/unit) 1.70 1.400.350.850.45 5.90 22. 750 g Lettuce. 400.00 0.80 270.93 22.360. kg Petrol.00 2. wine.800. As we see from Tables 11.35 4.400.00 8.00 414. cheese.891. ∑ Pn Qn ∑ P0Qn * 100 10. Qn 75 80 60 20 10 200 50 200 300 80 800 0.05 the quantities are for the current consumption period.17 1.00 0. 400.00 130. each Apples. or unit) Bread.925.875.00 180.440.11 and 11. These fixed quantities can be the average quantities consumed within the time periods considered or some other appropriate fixed values.00 5. in that the consumption of bread. fish.10 3.51 Quantity consumed in 2005.00 1.00 450. kg Chicken.20 17. and chicken (white meat) is up. and petrol (family members walk) is down whereas the consumption of lettuce.50 0. Thus. coffee.60 27. using equation 11(v). P0 (£/unit) 1.50 5. volume.50 1.00 350.35 3.00 or 100 Paasche price index in 2005 is.25 360.17 P 0 * Qn Pn * Qn Item and unit size (weight. Thus. 200 g Cheese.50 129. with the Paasche index using revised consumption patterns it indicates that the prices have increased 30% in the period 2000 to 2005. ∑ Pn Qa ∑ P0Qa * 100 11(vi) .50 6.05 * 100 8.00 1.12 there were changes in consumption patterns so that we might say that the index does not fairly represent the period in question.50 4. Paasche price index in 2000 is.00 312.50 276.00 1.00 210.00 760.60 2005.50 * 100 8.447. kg Cereals. Average quantity-weighted price index In the Laspeyres method we used quantities consumed in early periods and in the Paasche method quantities consumed in later periods.00 475. litre Fish.70 18. An alternative approach to the Laspeyres and Paasches methods is to use fixed quantity values that are considered representative of the consumption patterns within the time periods considered.00 220. 2000.50 101.10 2.60 20.65 or 130 rounding up. In this case.18 1.Chapter 11: Indexing as a method for data analysis 395 Table 11.00 51.50 1.50 100. we have an average quantity weighted price index as follows: Average quantity-weighted price index ∑ Pn Qn ∑ P0Qn * 100 8.00 4.12 Paasche price index.95 2. kg Milk.

litre Fish.00 10.50 131.88 450.17 P0*Qa Pn*Qa Item and unit size (weight.320. The usefulness of this index is that we have the flexibility to choose the base price P0 and the fixed weight Qa.150.00 97. 75 cl Instant coffee. 438.13.447.93 22. ● ● ● Average quantity weighted price index in 2005 is.50 150. kg Chicken.396 Statistics for Business Here.00 40. volume. Pn is the price in the current period. From equation 11(vi) using this information.18 1. litre Laptop computer TOTAL 123.53 .00 1. Average quantity weighted price index in 2000 is. This average quantity consumed is in fact a fixed quantity and so this approach is sometimes referred to as a fixed weight aggregate price index.092.90 22. 933. Here we have used an average weight but this fixed quantity can be some other value that we consider more appropriate.17 1.50 900.875.10 3. P0 (£/unit) 1.50 1.00 60.70 18.50 135.00 1.00 1. 750 g Lettuce.00 1.00 300.00 2. P0 is the price in the base period.00 379.53 * 100 7. ∑ Pn Qa ∑ P0Qa * 100 10.00 78.70 1. Qa 112.10 2. each Apples.080.13 Average price index 2000. kg Cereals.00 90.65 202.60 27.50 151. kg Petrol.75 345.00 4.50 4.350.50 100.850.955.50 6.350.60 20.50 * 100 7.60 2005.35 4.933.50 100.75 165. 933.95 2.00 55.00 700.00 279.280. The new data is given in Table 11. kg Milk.51 Average quantity consumed between 2000 and 2005.438.00 0.00 37.00 0.50 3.50 160. Table 11.00 17.20 17.45 5. 933.00 1. Pn (£/unit) 1.00 225.00 0.00 286. or unit) Bread. 200 g Cheese.00 7.5 0.925.58 or 132. loaf Wine.35 3.00 1. ∑ PnQa ∑ P0Qa * 100 7.50 1.00 210. Qa is the average quantity consumed in the total period in consideration.50 5.50 475.00 or 100 Rounding up this indicates that prices have increased 32% in the period.

the index is fairer and more representative of consumption patterns in the period. There can be many RRIs depending on the values that we wish to compare. or as illustrated in the chapter. A weighted price index is where different weights are put onto the index to indicate their importance in calculating the index. In this case. we convert the historical sales index to 100 by dividing this value by itself and multiplying by 100. and dividing the total value by the sum of the product of the price in the base period and the consumption in the base period. This is analogous to causal regression analysis where we establish whether the change in one variable is caused by the change in another variable. The base value is converted to 100 so that the relative values show a percentage change. the cost of labour compared to France. the price of housing in major cities compared to say London.Chapter 11: Indexing as a method for data analysis 397 This chapter has introduced relative time-based indexes. Some RRIs might be the cost of living in other locations compared to say New York. The new relative index values are then the old values divided by the historical index value. which is the ratio of total product of current consumption and current price. Rather than having a fixed base we can have rolling index where the base value is the previous period so that the change we measure is relative to the previous period. Relative regional indexes The goal of relative regional indexes (RRIs) is to compare the data values at one region to that of a base region. Relative index values can be compared to others to see if there is a relationship between one index and another. An alternative to the Laspeyres index is the Paasche weighted price index. An alternative to both the Laspeyres and Paasche index is to use an average of the quantity consumed during the period considered. . RRIs. Chapter Summary Relative time-based indexes The most common relative time-based indexes are the quantity and price index. Weighting the index An unweighted index is one where each element used to calculate the index is considered to have equal value. When the index base is too far in the past the index values may become too high to be meaningful. In their most common form these indexes measure the relative change over time respective to a given fixed base value. To do this we use a time series deflation which determines the real value in the change of a commodity. In this way. A useful comparison of indexes is to compare the index of wage or salary changes to see if they are in line with the change in the CPI. An often used price index is the CPI which indicates the change in prices over time and thus is a relative measure of inflation. and weighted indexes as a way to present and analyse statistical data. divided by the total product of current consumption and base price. This is how we would record annual or monthly changes. A criticism of this index is that if the time period is long it does not take into account changing consumption patterns. The Laspeyres price index is where the index is weighted by multiplying the price in the current period. by the quantity of that item consumed in the base period.

645 9.142 Year Backlog ($billions) 10.900 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Required 1. Develop the quantity index numbers for this data where 1988 has an index value of 100.558 11.500 9. based on 2000.710 10. Develop the quantity index for this data where the year 2000 has an index value of 100.725 15.398 Statistics for Business EXERCISE PROBLEMS 1. 4. . 2000. 2.607 14. 1998. In the following table are the backlog revenues of the firm in billions of dollar since 1988. chemical plants.361 9. how would you describe the backlog of the firm. Normally. and 2004? 5 Fluor Corporation Annual reports. the volume of work is calculated in terms of labour hours and material costs and this is then converted into estimated revenues. Backlog Situation Fluor is a California-based engineering and constructing company that designs and builds power plants.659 8. and 2005? 5. oil refineries. and other processing facilities. and 2005? 3.706 14. How would you describe the backlog of the firm.022 14. The backlog represents the amount of work that will be completed in the future. based on 1988.766 14. Why is an index number based on 2000 preferred to an index number of 1988? 6. Using the rolling quantity index. How would you describe the backlog of the firm. Develop a rolling quantity index from 1988 based on the change from the previous period.000 11. 1993. 1994.800 14. 7. in 1989. in 1990.181 14.754 Year Backlog ($billions) 14. Year Backlog ($billions) 6. in 1989.400 12.5 Backlog is the amount of work that the company has contracted but which has not yet been executed.

doe.gov (consulted July 2006). in 1996. Why is an index number based on 1996 preferred to an index number of 1987? 6. 2001. How would you describe gold prices. which year saw the biggest annual increase in the price of gold? 3. http://www. based on 1996. Gold Situation The following table gives average spot prices of gold in London since 1987. 7. US Department of Energy. Develop the price index numbers for this data where 1987 has an index value of 100. and 2005? 3.Chapter 11: Indexing as a method for data analysis 399 2. In 1971 President Nixon allowed the $US to float by eliminating its convertibility into gold. Year Gold price ($/ounce) 446 437 381 384 362 344 360 384 384 388 Year Gold price ($/ounce) 331 294 279 279 271 310 364 410 517 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Required 1. . 2. Develop the price index numbers for this data where the year 1996 has an index value of 100. Concerns over the economy and scarcity of natural resources resulted in the gold price reaching $850/ounce in 1980 which coincided with peaking inflation rates. which year saw the biggest annual decline in the price of gold? 8. United States gasoline prices Situation The following table gives the mid-year price of regular gasoline in the United States in cents/gallon since 19907 and the average crude oil price for the same year in $/bbl. based on 1987.8 6 7 Newmont. and 2005? 5.6 In 1969 the price of gold was some $50/ounce. The price of gold bottomed out in 2001.wtrg. Develop a rolling price index from 1987 based on the change from the previous period. 2005 Annual Report. 4. How would you describe gold prices. Using the rolling price index. in 1987.com/oil (consulted July 2006). 8 http://www. 2001. Using the rolling price index.

Why might an index number based on 2000 be preferred to an index number of 1990? 6. 10. Develop a rolling price index from 1990 based on the change from the previous period. in 1993. How would you describe gasoline prices based on 1990.50 169.40 112. .20 116. and 2005? 5.10 121. Using the rolling price index. 1998.10 120. Develop the price index numbers for this data where 2000 has an index value of 100.70 136.10 112.9 9 International Coffee Organization. based on 2000. What are your comments related to the graphs you developed in Question 9? 4. which year saw the biggest annual increase in the price of regular gasoline? 8.10 106.80 100. 1998. and 2005? 3.400 Statistics for Business Year Price of regular grade gasoline (cents/US gallon) 119. http://www.00 134.ico.40 121.30 185. 4. Coffee prices Situation The following table gives the imported price of coffee into the United Kingdom since 1975 in United States cents/pound.org (consulted July 2006). Develop the price index for regular grade gasoline where 1990 has an index value of 100.90 292. 7. Develop the price index for crude oil prices where 1990 has an index value of 100. Plot the index values of the gasoline prices developed in Question 1 to the crude oil index values developed in Question 8.10 112.80 Oil price ($/bbl) 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 20 38 20 19 18 19 20 22 19 12 15 30 25 25 27 35 62 Required 1. 9.40 251.20 142. 2. How would you describe gasoline prices. in 1993.

Develop the price index for the imported coffee prices where 1975 has an index value of 100.09 1.65 1.51 979.94 Required 1.39 1.29 699.83 1.103.45 730. 1995.54 923.11 809.30 1.339.027. 10 The Boeing Company 2005 Annual Report.08 1. 2.181. which year and by what amount was the biggest annual increase in the price of imported coffee? 10.84 734.13 1.374. How would you describe coffee prices based on 1990. and 2004? 7.066.58 1. How would you describe coffee prices based on 2000. Boeing Situation The following table gives summary financial and operating data for the United States Aircraft Company Boeing.477.273.61 Year 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 US cents/1b 1. 4.421. in 1985.233. 6. and 2004? 5.80 872. was the annual decrease in the price of imported coffee? 11.84 817.009.567. How would you describe coffee prices based on 1975. and 2004? 3. Using the rolling price index. in 1985. 1995.49 1. Using the rolling price index.65 1.46 965.530. . Develop a rolling price index from 1975 based on the change from the previous period.52 1.47 1.119.273.51 1.10 All the data is in $US millions except for the earnings per share.17 455.102.340.30 804.9 1. 1995. which year and by what amount. Why are coffee prices not a good measure of the change in the cost of living? 5. Develop the price index for the imported coffee prices where 1990 has an index value of 100.Chapter 11: Indexing as a method for data analysis 401 Year 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 US cents/1b 329. Develop the price index for the imported coffee prices where 2000 has an index value of 100.21 1.011. 9. in 1985.10 1.55 1. Which index base do you think is the most appropriate? 8.

017 135. Develop the index numbers for revenues using 2005 as the base.40 0.845 2.90 6.94 10. Ford Motor Company Situation The following table gives selected financial data for the Ford Motor Company since 1992.80 0. high ($/share) 8.47 1.457 1.693 4.95 15.65 1.85 0. 3.308 4. .425 138.407 91.61 7.42 18. how would you describe the progression of revenues? 6.05 0.777 130.91 1.40 0.64 25.24 3.92 12.827 3.222 4.59 18.40 104.40 6.00 13.88 1.33 17.84 6.46 31.256 718 0.402 Statistics for Business 2005 Revenues Net earnings Earnings/share Operating margins (%) Backlog 54.029 140.253 147.137 110.80 104.831 492 2.23 1.75 Stock price.78 12.432 4.139 4.933 4.131 4. Use the index values developed in Question 5.80 1.370 4.600 2003 50.591 4.30 31.402 4.920 22.292 4.20 106.467 5.07 7.024 Stock price.446 6.69 14.503 Net income total company ($millions) 7.80 109.237 3.279 4.34 33.020 3. 2002 and 2005.03 9. 4.835 2.976 118.071 7. 6.529 5.572 3.915 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 0. low ($/share) 5.76 37. How would you describe the revenues for 2001 using the base developed in Question 1? Develop the index numbers for earnings/share using 2001 as the base? How would you describe the earnings/share for 2005 using the base developed in Question 3? 5.10 160. 2.173 2001 57.42 21.487 2.568 107.787 4.72 1.473 2004 52.453 980 495 3.827 134.23 17.57 Dividends ($/share) Vehicle sales North America units 000s 3. Develop a rolling index for revenues since 2001.40 11 Ford Motor Company Annual Reports.06 12.496 116.812 2002 53.34 14.85 9.11 Year Revenues automotive ($millions) 84.128 153.591 Required 1.872 2.40 0.44 9.970 2.70 6.886 121.19 5.80 0.58 12.

Develop the index numbers for revenues using 1992 as the base. 4. how would you describe the situation of the Ford Motor Company? 7. Using France as the base. Develop the index numbers for North American vehicle sales using 1992 as the base. From the information given.00 3. and which was the worst? 7. and Germany? 3.00 7. how would you describe the percentage of 15.and 16-year olds who admitted to being drunk 3 times or more in a 30-day period in 2003.00 16.00 Required 1. Drinking Situation In Europe. in which years did the revenues decline. alcohol consumption rates are rising among the young.00 9.and 16-year olds who admitted to being drunk 3 times or more in a 30-day period in Ireland. Using the index for Britain developed in Question 1.12 Country Britain Denmark Finland France Germany Greece Ireland Italy Portugal Sweden Percentage 23. how would you describe the percentage of 15. and Germany? 12 Europe at tipping point.Chapter 11: Indexing as a method for data analysis 403 Required 1. International Herald Tribune. How would you describe the revenues for 2005 using the base developed in Question 1? Develop the rolling index for revenues starting from 1992.00 3.00 26. 1 and 4. Greece. Using Britain as the base. . Using the rolling index based on the previous period.00 3. The following table gives the percentage of 15. 3.00 26. 2. develop a relative regional index for the percentage of 15. Based on the index numbers developed in Question 5 which was the best comparative year for vehicle sales. 2. 4. 26 June 2006. develop a relative regional index for the percentage of 15.00 10.and 16-year olds who admitted to being drunk 3 times or more in a 30-day period in Ireland.and 16-year olds who admitted to being drunk 3 times or more in a 30-day period. and from the data that you have developed. and by how much? 5. Using the index for France developed in Question 3.and 16-year olds who admitted to being drunk 3 times or more in a 30-day period. 6. p. Greece.

and 16-year olds who admitted to being drunk 3 times or more in a 30-day period in Ireland.00 14. Using the index for the United States developed in Question 1.00 36.80 24. Using the index for Denmark developed in Question 3.00 6. how would you describe the percentage of 15.00 15.40 Australia Austria Belgium Britain Canada Denmark Finland France Germany Greece Ireland Italy Japan The Netherlands New Zealand Norway Portugal Spain Sweden Switzerland Turkey United States Required 1.00 18.50 13.00 21.30 68. Greece.60 67. develop a relative regional index for the percentage of people working part time. Using Denmark as the base.80 80.30 74.13 Country Working part time.00 10. The Economist.60 64.10 81.90 17.50 17.00 Percentage of part timers who are women 68. and Switzerland? 13 Economic and financial indicators. . percentage of total employment 27. 2.70 11.00 22.50 5.50 14. 110.404 Statistics for Business 5.30 83.70 76.00 17.40 69.60 79.00 69.80 74.00 26.60 79. develop a relative regional index for the percentage of 15. Based on the data what general conclusions can you draw? 8. 24 June 2006. Using the United States as the base. Part-time work Situation The following table gives the people working part time in 2005 by country as a percentage of total employment and also the percentage of those working part time who are women.00 22.00 12.40 68. and Germany? 7.and 16-year olds who admitted to being drunk 3 times or more in a 30-day period.00 67.80 77. Greece.50 82. Part-time work is defined as working less than 30 hours/week.90 78. 6.10 63.70 59.00 16. p. how would you describe the percentage of people working part time in Australia.10 78.50 25.

78 1.37 1.77 11. The exchange rates used in the tables are £1. . Greece.700 892 1.12 2.99 13.08 12.50 4.34 13.98 2.17 1.89 3.63 0.10 1. htm. Using the Netherlands as the base.06 17.88 1. 4.52 13.528 720 652 571 824 553 1.25 10.49 1.com/costofliving.65 14.88 14.51 0.46 3.15 1.303 754 926 1.79 2.55 N/A 1. http://www. of certain items and rental costs in major cities worldwide in 2006.60 1.14 0.58 11.43 N/A 15. Using Britain as the base.05 1. Greece.21 1.71 1.75 €1.03 12.99 2.79 2.Chapter 11: Indexing as a method for data analysis 405 3. Using the index for Britain as developed in Question 5.00 $1.46 4.06 1.03 12.37 2. 2006.01 11. develop a relative regional index for the percentage of people working part time who are women? 6.58 2.47 1.93 1.37 2.18 3.70 6.60 1.63 1.23 2.84 2.10 0.998 1.49 1.84 4.69 1.74 2.91 2.44 14.41 0.352 804 754 754 Bus or subway (£/ride) Compact International disc (£) newspaper (£/copy) Cup of coffee including service (£) Fast food hamburger meal (£) Amsterdam Athens Beijing Berlin Brussels Buenos Aires Dublin Johannesburg London Madrid New York Paris Prague Rome Sydney Tokyo Vancouver Warsaw Zagreb 1.58 14 Global/worldwide cost of living survey ranking.43 4. how would you describe the percentage of people working part time who are women for Australia. develop a relative regional index for the percentage of people working part time.80 N/A 1.13 0.32 1.88 2.44 1.51 2.42 1.46.44 1. Cost of living Situation The following table gives the purchase price.03 N/A 2.14 These numbers are a measure of the cost of living.26 1.26 3.104 2.72 10.37 1.51 1.58 4.71 2. City Rent of 2 bedroom unfurnished apartment (£/month) 926 721 1.06 1. how would you describe the percentage of people working part time in Australia.61 13.finfacts.08 13.29 1.35 4. at medium-priced establishments. and Switzerland? 9.77 1.03 0. and Switzerland? What can you say about the part-time employment situation in the Netherlands? 5.90 1. Using the index for the Netherlands developed in Question 3.20 1.00 0.71 0.97 1.74 1.75 1.96 0.

compare to London? 2.7 4. Tokyo. The following table gives the corruption index for the first 50 countries in terms of being the least corrupt. Berlin. Sydney. Paris. New York.5 5. 10. Sydney. Tokyo. compare to Madrid? 3. compare to Madrid? 6. Transparency International. Berlin. how do Amsterdam. Sydney.15 Country Australia Austria Bahrain Barbados Belgium Botswana Canada Index 8. http://www. The scores range from 10 or squeaky clean. drawing on 16 surveys from 10 independent institutions. Tokyo. New York. and Vancouver. Using rental costs as the criterion. Berlin.8 6. Using rental costs as the criterion. how do Amsterdam. New York. and Vancouver. Berlin. compare to London? 5. defines corruption as the abuse of public office for private gain. Sydney.3 8. New York. com (consulted July 2006).9 8. A score of 5 is the number Transparency International considers the borderline figure distinguishing countries that do not have a serious corruption problem. compare to Prague? 7. Paris. Tokyo. and Vancouver. highly corrupt. how do Amsterdam.9 7. Corruption Situation The Berlin-based organization. and Vancouver. Berlin. Using rental costs as the criterion.8 8. compare to Prague? 4.406 Statistics for Business Required 1. Sydney. Paris. Only 159 of the world’s 193 countries are included in the survey due to an absence of reliable data from the remaining countries.8 8. and Vancouver. Using the sum of all the purchase items except rent as the criterion. Sydney. . Using rental costs as the criterion. New York. and measures the degree to which corruption is perceived to exist among a country’s public officials and politicians. to zero. Using the sum of all the purchase items except rent as the criterion. Using the sum of all the purchase items except rent as the criterion. Tokyo.6 4. how do Amsterdam. infioplease. Tokyo. Paris. how do Amsterdam. Paris. how does the most expensive city compare to the least expensive city? Identify the cities.4 5.4 Country Kuwait Lithuania Luxembourg Malaysia Malta Namibia The Netherlands Index 4. New York.6 15 The 2005 Transparency International Corruption Perceptions Index. Paris.7 5. Berlin. how do Amsterdam. and Vancouver.1 6. It is a composite index. which gather the opinions of business people and country analysts.

From the countries in the list which country is the least corrupt and which is the most corrupt? 2.4 4.9 4.6 5.16 16 Emerging market indicators. 102.5 7. Finland. p.6 7.2 9. Compare Denmark. 6. and England using Italy as the base. Germany.000 of the population.9 Required 1.3 5. Road traffic deaths Situation Every year over a million people die in road accidents and as many as 50 million are injured.4 9. and England using Spain as the base.0 7. . Compare Denmark.3 8.3 6.1 4. This dismal toll is likely to get much worse as road traffic increases in the developing world.7 5. Finland.Chapter 11: Indexing as a method for data analysis 407 Country Chile Cyprus Czech Republic Denmark Estonia Finland France Germany Greece Hong Kong Hungary Iceland Ireland Israel Italy Japan Jordan South Korea Index 7.1 5.6 7. and England using Greece as the base. in not having a serious corruption problem? 3.7 4.4 6. What conclusions might you draw from the responses to Questions 3 to 6? 11. Compare Denmark. What is the percentage of countries that are above the borderline limit as defined by Transparency International. Germany.5 8.0 9. Germany. Finland. 4.9 6. Germany.5 6.3 5. 7.3 5.0 Country New Zealand Norway Oman Portugal Qatar Singapore Slovakia Slovenia South Africa Spain Sweden Switzerland Taiwan Tunisia United Arab Emirates United Kingdom United States Uruguay Index 9. Finland. The Economist. The following table gives the annual road deaths per 100.3 6.3 5.6 8.0 9.2 8. 5.5 5.7 7.9 6. Over 80% of the deaths are in emerging countries.3 9.2 4. Compare Denmark. and England using Portugal as the base.9 9. 17 April 2004.

org (consulted July 2006). The Dominican Republic. The Dominican Republic.000 people 17 45 13 23 18 18 12 11 20 14 14 24 21 15 24 Belgium Britain China Columbia Costa Rica Dominican Republic Ecuador El Salvador France Germany Italy Japan Kuwait Latvia Lithuania Luxembourg Mauritius New Zealand Nicaragua Panama Peru Poland Romania Russia Saint Lucia Slovenia South Korea Thailand United States Venezuela Required 1. Latvia. How would you compare Belgium. What are your overall conclusions and what do you think should be done to improve the statistics? 12. . 17 World Food Prices. France. Mauritius.000 people 16 5 16 18 19 39 18 42 4 6 13 8 21 25 22 Country Deaths per 100. Luxembourg. Latvia. and Venezuela to the United States? 4. and Venezuela to New Zealand? 6. and Venezuela to Kuwait? 5. http://www. From the countries in the list. France. How would you compare Belgium. Mauritius. Russia. The Dominican Republic.earth-policy. Luxembourg. France. The Dominican Republic. Mauritius. France. Russia. Mauritius. Russia. Family food consumption Situation The following table gives the 1st quarter 2003 and 1st quarter 2004 prices of a market basket of grocery items purchased by an American family. Luxembourg.17 In the same table is the consumption of these items for the same period. How would you compare Belgium. Russia. Latvia. Latvia. and Venezuela to Britain? 3. How would you compare Belgium. which country is the most dangerous to drive and which is the least dangerous? 2. Luxembourg.408 Statistics for Business Country Deaths per 100.

438.22 3. poultry. Develop a Paasche weighted price index using the 1st quarter 2003 for the base price.96 3.59 3.05 3.618.598.21 2.42 2.32 2. 3.788.00 3.885.83 1. Table 1 Average annual price of meat product ($US/ton).00 1.25 757.80 2.75 847. 4.36 3.83 3.67 646.53 2.151.75 2005 4.87 2.18 Table 2 gives the quantities handled by the meat wholesaler in the same period 2000 to 2005. United States 2000 2.22 1.885. Develop an average quantity weighted price index using 2003 as the base price period and the average of the consumption between 2003 and 2004.843.09 1st quarter 2003 quantity (units) 160 60 15 35 42 96 52 37 152 19 42 45 98 19 32 21 1st quarter 2004 quantity (units) 220 94 16 42 51 121 16 42 212 27 62 48 182 33 68 72 Ground chuck beef (1 lb) White bread (20 oz loaf) Cheerio cereals (10 oz box) Apples (1 lb) Whole chicken fryers (1 lb) Pork chops (1 lb) Eggs (1 dozen) Cheddar cheese (1 lb) Bacon (1 lb) Mayonnaise (32 oz jar) Russet potatoes (5 lb bag) Sirloin tip roast (1 lb) Whole milk (1 gallon) Vegetable oil (32 oz bottle) Flour (5 lb bag) Corn oil (32 oz bottle) Required 1. Table 1 gives the prices for these products in $US/ton for the period 2000 to 2005. 6. 5.08 2.25 2. Meat Situation A meat wholesaler exports and imports New Zealand lamb.46 3.17 2.765. (frozen whole carcasses) United States beef.05 1. Calculate an unweighted quantity index for this data.42 1.303.25 611.58 2001 2. http://www.172.org/es/esc/prices/CIWPQueryServlet (consulted July 2006).50 4.911.67 592.76 1.89 3.25 1.10 1.10 1.048.17 2. Calculate an unweighted price index for this data.52 2.58 2004 4.161.070.17 International Commodity Prices. Discuss the usefulness of these indexes. United States Poultry.78 1. 2.Chapter 11: Indexing as a method for data analysis 409 Product unit amount 1st quarter 2003 ($/unit) 2.24 3.00 3.14 1.58 3. United States Pork. United States broiler cuts and frozen pork. Product New Zealand Lamb Beef.92 1.27 1. Develop a Laspeyres weighted price index for this data.91 3.62 3.074. 13.08 2002 3.fao. 18 .41 1st quarter 2004 ($/unit) 2.58 2003 3.795.33 581.396.48 1.67 2.30 2.

92 79. United States Poultry. tea.575 107.43 1.887 66.69 80.19 Table 2 gives the quantities distributed by the wholesaler in the same period 2000 to 2005.759 50.219 2005 112.125 125. 2000 75.429 29.545 Commodity Sugar Tea Coffee Cocoa 19 International Commodity Prices.427 41.410 Statistics for Business Table 2 Product Amount handled each year (tons).90 1. United States Required 1.618 35.640 2003 94. coffee.10 1. Mombasa ($US/kg) Coffee (US cents/lb) Cocoa (US cents/lb) Table 2 Amount distributed each year (kg).org/es/esc/prices/CIWPQueryServlet (consulted July 2006). 3.165 109. 2.450 122.52 45.310 58. Develop an average quantity weighted price index using as the base both the average quantity distributed in the period and the average price for the period.03 2002 6.37 Commodity Sugar (US cents/lb) Tea.97 64.03 70. Develop an average quantity weighted price index using the average quantities consumed in the period and 2005 as the base period for price. United States Pork.632 79.584 2002 72.303 46. .27 2001 8.911 2004 104.457 131.67 49.125 118.197 39. 4.441 52.589 34. http://www.254 2001 67.125 129. 2000 8.715 2001 80.124 115.727 30.254 New Zealand Lamb Beef.125 110.76 73. Develop a Laspeyres weighted price index using 2000 as the base period. 5. Table 1 Average annual price of commodity.860 29.fao.875 49.000 105.904 47. The distributor buys these four commodities from its supplier at the prices indicated in Table 1 for the period 2000 to 2005.145 47.450 41.26 2005 9.135 120. Develop a Paasche weighted price index using 2005 as the base period. What are your observations about the data and the indexes obtained? 14. 2000 54.254 2004 85.450 42.840 47.125 45.857 2005 95.55 62.150 120.966 73.56 40.156 2002 85.300 27.894 2003 79.311 59. and cocoa to various coffee shops in the west coast of the United States. Beverages Situation A wholesale distributor supplies sugar.70 1.16 1.91 1.58 2003 7.54 51.47 82.57 2004 7.49 47.055 51.

646 79.600 Table 2 Metal Consumption (tons/year).011 Required 1. http://www.050 3. What are your observations about the data and the indexes obtained? 20 London Metal Exchange.650 1. $US/ton.150 2005 2.187 2002 2003 2004 126.919 22.041 93.786 112. Develop an average quantity weighted price index using the average quantities consumed in the period and 2005 as the base period for price. 4. 2. 4.688 4.788 47.415 36. Develop a Paasche weighted price index using 2005 as the base period. Develop an average quantity weighted price index using the average quantities consumed in the period and 2005 as the base period for price.uk/dataprices (consulted July 2006). 5.535 31. Non-ferrous metals Situation Table 1 gives the average price of non-ferrous metals in $US/ton in the period 2000 to 2005. Develop an average quantity weighted price index using as the base both the average quantity distributed in the period and the average price for the period.500 1. 2000 53. . What are your observations about the data and the indexes obtained? 15.550 7.772 75.502 18. 2000 1. 2.345 21. 3.100 2001 1.550 4.700 2.257 2005 102.600 900 2002 1.525 2. 5.lme.Chapter 11: Indexing as a method for data analysis 411 Required 1. 3. Develop a Laspeyres weighted price index using 2000 as the base period.470 106.425 1.302 48.563 126.600 1.000 5.570 13.712 Aluminium Copper Tin Zinc 86. Develop an average quantity weighted price index using as the base both the average quantity consumed in the period and the average price for the period.800 1.800 7. Table 1 Metal Aluminium Copper Tin Zinc Average metal price.916 49. Develop a Paasche weighted price index using 2005 as the base period.250 800 2003 1.650 1.130 32.20 Table 2 gives the consumption of these metals in tons for a manufacturing conglomerate in the period 2000 to 2005.888 5. Develop a Laspeyres weighted price index using 2000 as the base period.678 14.co.500 900 2004 1.443 63.000 18.158 2001 100.

762 3.158 70.443 19.926 33.308 40.264 328.422.423 22.714.268 21.202.168 21.260.176.571 141. June 2006 (posted 27 June 2006).711.763 18.757.035.295 37.914.085 17.500 64.436 7.454.doe.123.252 2.635 2.831.247 349.893 315.334.153 78.935 17.655 2.811 2.809 15.933 16.349 8.894 57.403.853 36.265.080 32.907.103 13.587 35.133.410.589 3.823.148.622 17.441.154 77.050 30.693.605 2.075 21.127.533 2.896 198.342 2.265 2.479.477 34.987 35.422.505.991 3.794 18.838.670 19.788 109.749 149.811.104.265 2.916.082 2.195 22.174.177 1.436 2.180 Nuclear Hydroelectric Biomass Geothermal Solar Wind 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 12.922.853 45.400 20.863 30.939.015.993 17.835.070.552.119 217.332 2.755 42.622.333.985 8.858 8.811.586 15.670.351.927.662.656 3.833.817 2.647 36.350 6.109 4.713.356 31.558 3.237 69.161 6.906.930.933.878 12.589.468 35.840 22.260 3.531 4.017 22.607 2.457 3.655.835.069.150 38.919 2.333.116 2.992 7.786 13.001.388 65.451 18.017 2.412 2.665.756.159 2.440 33.268 23.589.634.007 910.899.729.339 164.688 63.eia.490 Required Using the concept of indexing.267.108 293.787 68.114 33.593.528 21.744 21.078 2.222 2.103 19.380.579.467 28 68 60 44 37 9 22.857 70.442 17.263 3.202.861.297. Case study: United States energy consumption Situation The following table gives the energy consumption by source in the United States since 1973 in million British Thermal Units (BTUs).657 1.282 219.732.350 83.783 23.671 55 111 147 109 94 55.143.914 21.122.419 23.088.241.903.967.053.168 37.007 2.303 330.775.484.265.508 2.686 2.449 22.846.905.649.683.534 32.154.539.463 20.184 2.959 328.983 2.739.909.356.148 19.458 2.526 15.734 1.144 2.780.584.801 346.716 338.075.071.256 7.010.746 129.617.689.817 20.701.391 3.272.803.615.865.089 7.959.617 105.111.399 21.922.796 311.602.900.529 324.921 31.321.411 21.935.126 2.458 3.765.057 69.221.697 8.586.627 32.471 19.690 19.148 3.445.630 33.178 229.718 62.936.082 351.454 64.290 317.067.206 6.662.529.428 17.073 23.179 2.992 34.936.221.665 40.21 Year Coal Natural gas 22.931.753.344 18.488 19.862.039.498.054 3.968 5.053 34.842.309 363.544 22.563 4.548 69.605 53.090 23.894.834.526.549 3.033 29.047.886 66.458 68.798 2.653 2.552.000.172.385.891.032.554 341.965.527.930.858 2.351 39.231.784.585 33.192 3.552.291 59.885.933 5.207 22.122.661.620 64.879. Monthly Energy Review. http://tonto.562 Petroleum products 34.613 2.062.607.418 64. .037.271 2.149 32.560 32.067.051.169 3.846.875 2.580 3.512.448 3.839.196.308 330.744.046.991.312 19.307 3.928 22.466.007 30.824.640.405 18. describe the consumption pattern of energy in the United States.449 2.919 316.776 123.674 6.628.690.688 37.727 21.623 38.976.833 70.068 1.947. 21 Energy Information Administration.345.394.639.588 20.132 6.553.563 2.075.906 2.478.596.500 2.105 3.024.827 2.573 3.995 2.968 3.314 30.635 18.211.197.083 1.809 7.989 22.581 15.328.883 20.334 114.982.121 2.610.163 335.971.717 2.320.361 33.645 38.575 2.841.490 12.401.008.005.048 2.007.043 104.830.929 20.540 37.845.205.373 1.958.067 13.796 29.793 66.327 30.274 34.391 63.426 19.341 3.381 34.702.gov.581 23.837.864.499 6.970.131.086.877 7.575 15.730.707.151.514 2.661 1.412 Statistics for Business 16.513 20.581 30.840.506.762 19.943 2.

is not correct at the given level of significance. is Asymmetrical data is numerical information that does not follow a normal distribution. Bayesian decision-making implies that if you have additional information. The x-axis is a numerical scale of the desired class width. A priori probability is being able to make an estimate of probability based on information already available. Arithmetic mean is the sum of all the data values divided by the amount of data. Addition rule for mutually exclusive events the sum of the individual probabilities. Average value metic mean. Alternative hypothesis is another value when the hypothesized value. or null hypothesis. Absolute frequency histogram is a vertical bar chart on x-axis and y-axis. Backup is an auxiliary unit that can be used if the principal unit fails. you can find that in Appendix III. It can also be called a Gantt chart after the American engineer Henry Gantt. ∑ PnQa * 100 ∑ P0Qa where P0 and Pn are prices in the base and current period. if you want to know the English equivalent of those that are Greek symbols. and the y-axis gives the length of the bar which is proportional to the quantity of data in a given class. Bar chart is a type of histogram where the x-axis and y-axis have been reversed. It is the same as the average value.Appendix I: Key terminology and formula in statistics Expressions and formulas presented in bold letters in the textbook can be found in this section in alphabetical order. In this listing when there is another term in bold letters it means it is explained elsewhere in this Appendix I. or based on the fact that something has occurred. This index is also referred to as a fixed weight aggregate price index. and Qa is the average quantity consumed during the period under consideration. Absolute in this textbook context implies presenting data according to the value collected. certain probabilities . Further. Average quantity weighted price index is. respectively. At the end of this listing is an explanation of the symbols used in this equation. is another term used for arith- Addition rule for non-mutually exclusive events is the sum of the individual probabilities less than the probability of the two events occurring together. In a parallel arrangement we have backup units.

good or bad. Boundary limits of quartiles are Q0. Bi-modal means that there are two values that occur most frequently in a dataset. Box and whisker plot is a visual display of quartiles. Categorical data is information that includes a qualitative response according to a name. x . Chi-square distribution is a continuous probability distribution used in this text to test a hypothesis associated with more than two populations. can be approximated by the normal distribution. right or wrong. The concept comes from Jacques Bernoulli (1654 –1705) a Swiss/French mathematician. or unrepresentative results. Benchmark is the value of a piece of data which we use to compare other data. or binomial. Binomial means that there are only two possible outcomes of an event such as yes or no. It is the reference point. Bivariate data involves two variables. Any data that is in graphical form is bivariate since a value on the x-axis has a corresponding value on the y-axis. are the groups into which data is Central limit theory in sampling states that as the size of the sample increases. Q3. Central moving average in seasonal forecasting is the linear average of four quarters around a given central time period. purposely or unknowingly. The probability of any outcome remains fixed over time and the trials are statistically independent. Categories organized. Bernoulli process is where in each trial there are only two possible outcomes. where the indices indicate the quartile value going from the minimum value Q0 to the maximum value Q4. etc. Q1. Bias in sampling is favouritism. The box contains the middle 50% of the data. With categorical information there may be no quantitative data. Box plot is an alternative name for the box and whisker plot. x and y. Causal forecasting is when the movement of the dependent variable. Europe. there becomes a point when the distribution of the sample – means.414 Statistics for Business may be revised to give posterior probabilities (post meaning afterwards). Characteristic probability is that which is to be expected or that which is the most common in a statistical experiment. misleading. x. Category is a distinct class into which information or entities belong. . and the United States or the categories of men and women. Central tendency is how data clusters around a central measure such as the mean value. false. present in sample data that gives lopsided. y. or category such as the categories of Asia. Bayes’ theorem gives the relationship for statistical probability under statistical dependence. The 1st whisker on the left contains the first 25% of the data and the 2nd whisker on the right contains the last 25%. Binomial distribution is a table or graph showing all the possible outcomes of an experiment for a discrete distribution resulting from a Bernoulli process. and Q4. As we move forwards in time the average changes by eliminating the oldest quarter and adding the most recent. If the sample size taken is greater than 30. label. is caused or impacted by the change in value of the independent variable. then the sample distribution of the means can be considered to follow a normal distribution even though the population is not normal. Q2.

etc. is an alternative description of the 415 nC x n! x !(n x)! Conditional probability is the chance of an event occurring given that another event has already occurred. Coefficient of correlation. Consumer surveys are telephone. Combination is the arrangement of distinct items regardless of their order. σ/μ. for sample sizes less ˆ than 30. when we have a sample size greater ˆ than 30 and by y tse.0000 cl. The value of r2 is always positive and less than or equal to the coefficient of correlation. The age groups. The value of r can take any value between 1. It is unlikely to be exactly 33. e. Confidence limits of a forecast are given by.00 and the sign is the same as the slope of the regression line. r2 is another measure of the strength of the relation between the variables x and y. Continuous data has no distinct cut-off point and continues from one class to another. b. c. or 33. written. a. The number of combinations is calculated by the expression. etc. Collectively exhaustive gives all the possible outcomes of an experiment. Continuous probability distribution is a table or graph where the variable x can take any value within a defined range. is the breadth or span of a given class. 30–39.5486 cl. Constant value is one that does not change with a change in conditions. r is a measure of the strength of the relation between the independent variable x and the dependent variable y. Confidence interval is the range of the estimate at the prescribed confidence level. Coefficient of variation of a dataset is the ratio of the standard deviation to the mean value. Coefficient of determination.Appendix 1: Key terminology and formula in statistics Chi-square test is a method to determine if there is a dependency on some criterion between the proportions of more than two populations. Continuity correction factor is applied to a random variable when we wish to use the normalbinomial approximation. or clusters. r. 50–59 years are four classes that can be groupings used in market surveys. The values of z and t are determined by the desired level of confidence. 20–29. Confidence level is the probability value for the estimate. Closed-ended frequency distribution is one where all data in the distribution is contained within the limits. such as a 95%. f. Cluster sampling is where the population is divided into groups. Class is a grouping into which data is arranged. Classical probability is the ratio of the number of favourable outcomes of an event divided by the total possible outcomes.3458. It is used as a measure of inflation. Classical probability is also known as marginal probability or simple probability. or verbal consumer responses concerning a given issue or product. . y zse.. electronic. 32. are typically used to represent a constant. 40–49. The volume of beer in a can may have a nominal value of 33 cl but the actual volume could be 32. either lower or upper case.9584. d. Consumer price index is a measure of the change of prices. The beginning letters of the alphabet. Class range Class width class range. and each cluster is then sampled at random. Confidence level may also be referred to as the level of confidence.00 and 1.

of columns 1).000). if sales for one month are $50. In graphical form this it is called an ogive. length. Degrees of freedom in a cross-classification table are (No. Dispersion dataset. It is useful for indicating how many observations lie above or below certain values. Discrete random variables are those integer values. is the same as relative Dataset is a collection of data either unsorted or sorted. Degrees of freedom in a Student-t distribution are given by (n 1). etc. is zero. volume. As examples. and 144 computers. of various types of experiments.000 and costs $40. of rows 1) * (No. the area under the curve is always unity. Data point is a single observation in a dataset.000 then it is certain that net income is $10.416 Statistics for Business Contingency table indicates data relationships when there are several categories present. Degrees of freedom means the choices that you have taken regarding certain actions. Counting rules are the mathematical relationships that describe the possible outcomes. Data is a collection of information. Cross-classification table indicates data relationships when there are several categories present. For example. It is also referred to as a cross-classification table. Distribution of the sample means is the same as the sampling distribution of the means.000 ($50. Deterministic is where outcomes or decisions made are based on data that are accepted and can be considered reliable or certain. 68. Critical value in hypothesis testing is that value outside of which the null hypothesis should be rejected. Curvilinear function is one that is not linear but curves according to the equation that describes its shape. Empirical probability frequency probability. or whole numbers. Descriptive statistics is the analysis of sample data in order to describe the characteristics of that particular sample. It is also referred to as a contingency table. Cumulative frequency distribution is a display of dataset values cumulated from the minimum to the maximum. It is the benchmark value. Empirical rule for the normal distribution states that no matter the value of the mean or the standard deviation. where n is the sample size. Covariance of random variables is an application of the distribution of random variables often used to analyse the risk associated with financial investments. 4 machines. Deviation about the mean of all observations. is the spread or the variability in a Data array is raw data that has been sorted in either ascending or descending order. Dependent variable is that value that is a function or is dependent on another variable. about the mean value x . or trials. Correlation is the measurement of the strength of the relationship between variables.000 $40. Graphically it is positioned on the y-axis. or results. – x. Discrete data is information that has a distinct cut-off point such as 10 students. Continuous random variables can take on any value within a defined range. that follow no particular pattern.26% of all data . Data characteristics are the units of measurement that describe data such as the weight. Discrete data come from the counting process and the data are whole numbers or integer values.

μx. database. or in part. It is the same as the mean value of the random variable and is given by the relationship. that produces an event. or μx E(x) np. Exponential function has the form y aebx. Factorial rule for the arrangement of n different objects is n! n(n 1)(n 2)(n 3) … (n n). Finite population is a collection of data that has a stated. Estimate in statistical analysis is that value judged to be equal to the population value. Expected value of the binomial distribution E(x) or the mean value. Fractiles divide data into specified fractions or portions. ˆ σp p2 1 p1q1 n1 p2 q2 n2 Estimated standard deviation of the distribution of the difference between the sample means is. Functions in the context of this textbook are those built-in macros in Microsoft Excel. polygon. The number of playing cards (52) in a pack is considered finite. Estimator is that statistic used to estimate the population value. 95. Microsoft Excel contains financial. Estimated standard error of the difference between two proportions is. it is principally the statistical functions that are employed. In this book. A stem-and-leaf display and a box and whisker plot are methods in EDA. The distribution can be a table. is the product of the number of trials and the characteristic probability. Estimated standard error of the proportion ˆ σp p (1 p ) n is. We can have an absolute frequency distribution or a relative frequency distribution. where – is the sample proportion and n is the p sample size. where 0! 1. However.44% falls within 2 standard deviations from the mean. Expected value of the random variable is the weighted average of the outcomes of an experiment.73% of all data falls within 3 standard deviations from the mean. Estimating is forecasting or making a judgment about a future situation using entirely. Frequency distribution groups data into defined classes. . μx ΣxP(x) E(x). Gaussian distribution is another name for the normal distribution after its German originator. N N n 1 417 falls within 1 standard deviations from the mean. or histogram. Finite population multiplier for a population of size N and a sample of size n is. and a and b are constants. Karl Friedrich Gauss (1777–1855).Appendix 1: Key terminology and formula in statistics Experiment is the activity. ˆ σx x2 1 ˆ2 σ1 n1 ˆ2 σ2 n2 Fixed weight aggregate price index is the same as the average quantity weighted price index. Event is the outcome of an activity or experiment that has been carried out. respectively. limited. Exploratory data analysis (EDA) covers those techniques that give analysts a sense about data that is being examined. where x and y are the independent and dependent variables. such as a sampling process. or a small size. Frequency polygon is a line graph connecting the midpoints of the class ranges. quantitative information. and other functions. and 99. logic.

Historical data is information that has occurred. Joint probability is the chance of two events occurring together or in succession. Groups are the units or ranges into which data is organized. Index base value is the real value of a piece of data which is used as the reference point to determine the index number. Index value number. where the y-values decrease from left to right. Hypothesis is a judgment about a situation. Interval estimate gives a range for the estimate of the population parameter. or has been collected in the past. Law of averages implies that the average value of an activity obtained in the long run will be close to the expected value. is an alternative for the index Inferential statistics is the analysis of sample data for the purpose of describing the characteristics of the population parameter from which that sample is taken. or population parameter based simply on an assumption or intuition with initially no concrete backup information or analysis. Kurtosis is the characteristic of the peak of the distribution curve. where n is the number of years. outcome. Independent variable in a time series is the value upon which another value is a function or dependent. The index number may be called as the index value. Integer values are whole numbers originating from the counting process. Laspeyres weighted price index is. Histogram is a vertical bar chart showing data according to a named category or a quantitative class range. ∑ Pn Q0 * 100 ∑ P0Q0 where Pn is the price in the current period. It has a negative slope. Inter-quartile range is the difference between the values of the 3rd and the 1st quartile in the dataset. It is calculated by the nth root of the growth rates for each year. Index number is the ratio of a certain value to a base value usually multiplied by 100. Greater than ogive is a cumulative frequency distribution that illustrates data above certain values. histograms. When the base value equals 100 then the measured values are a percentage of the base. Graphs are visual displays of data such as line graphs. It measures the range of the middle half of an ordered dataset. or pie charts. Infinite population is a collection of data that has such a large size so that by removing or destroying some of the data elements it does not significantly impact the population that remains. Horizontal bar chart is a bar chart in a horizontal form where the y-axis is the class and the x-axis is the proportion of data in a given class. Graphically the independent variable is always positioned on the x-axis. Hypothesis testing is to test sample data and make on objective decision based on the results of the test using an appropriate significance level for the hypothesis test. or the weighted outcome based on each probability of occurrence. P0 is the price in the base period and Q0 is the quantity consumed in the base period.418 Statistics for Business Geometric mean is used when data is changing over time. Least square method is a calculation technique in regression analysis that determines the best .

leptokurtic and a relatively flat peak. Mean proportion of successes. Mid-spread range quartile range. Leaves are the trailing digits in a stem-and-leaf display. where α is the proportion in the tails of the distribution. Midpoint of a class range is the maximum plus the minimum value divided by 2. ˆ It is the equation of the best straight line for the data that minimizes the error between the data points on the regression line and the corresponding actual data from which the regression line is developed. Mid-hinge in quartiles is the average of the 3rd and 1st quartile. It divides the data into two equal halves. substituting for the mean value and the standard deviation of the binomial distribution in the normal distribution transformation relationship we have. Mutually exclusive events not occur together. or platykurtic. Multiple regression is when the dependent variable y is a function of many independent variables. and the curve of the distribution tails off to the left side of the x-axis. “Is there evidence that a value is less than?” Leptokurtic is when the peak of a distribution is sharp. Line graph shows bivariate data on x-axis and y-axis. If time is included in the data this is always indicated on the x-axis. Mesokurtic describes the curve of a distribution when it is intermediate between a sharp peak. Margin of error is the range of the estimate from the true population value. It can be represented by an equation of the general form. Level of confidence in estimating is (1 α). Marginal probability is the ratio of the number of favourable outcomes of an event divided by the total possible outcomes. Marginal probability is also known as classical probability or simple probability.Appendix 1: Key terminology and formula in statistics straight line for a series of data that minimizes the error between the actual and forecast data. or that area outside of the confidence interval. Left-skewed data is when the mean of a dataset is less than the median value. Median is the middle value of an ordered set of data. Mean value of random data is the weighted average of all the possible outcomes of the random variable. Mean value is another way of referring to the arithmetic mean. y a b1x1 b2x2 b3x3 ˆ … bkxk. μ– p p. Midrange is the average of the smallest and the largest observations in a dataset. In this case. z x σ μ x np npq x np(1 np p) . As a graph it has a positive slope such that the y-values increase from left to right. are those that can- Normal-binomial approximation is applied when np 5 and n(1 p) 5. is another term for the inter- 419 Mode is that value that occurs most frequently in a dataset. quantified by a small standard deviation. Linear regression line takes the form y a bx. The 2nd quartile and the 50th percentile are also the median value. Less than ogive is a cumulative frequency distribution that indicates the amount of data below certain limits. Left-tail hypothesis test is used when we are asking the question.

420

Statistics for Business Normal distribution, or the Gaussian distribution, is a continuous distribution of a random variable. It is symmetrical, has a single hump, and the mean, median and mode are equal. The tails of the distribution may not immediately cut the x-axis. Normal distribution density function, which describes the shape of the normal distribution is, f (x) 1 2πσx e

(1/ 2)[(x μx )/ σx ]2

ogive shows data more than certain values. An ogive can illustrate absolute data or relative data. One-arm-bandit is the slang term for the slot machines that you find in gambling casinos. The game of chance is where you put in a coin or chip, pull a lever and hope that you win a lucky combination! One-tail hypothesis test is used when we are interested to know if something is less than or greater than a stipulated value. If we ask the question, “Is there evidence that the value is greater than?” then this would be a right-tail hypothesis test. Alternatively, if we ask the question, “Is there evidence that the value is less than?” then this would be a left-tail hypothesis test. Ordered dataset is one where the values have been arranged in either increasing or decreasing order. Outcomes of a single type of event are kn, where k is the number of possible events, and n is the number of trials. Outcomes of different types of events are k1 * k2 * k3 * * kn, where k1, k2, , kn are the number of possible events. Outliers are those numerical values that are either much higher or much lower than other values in a dataset and can distort the value of the central tendency, such as the average, and the value of the dispersion such as the range or standard deviation. P in upper case or capitals is often the abbreviation used for probability. Paired samples are those that are dependent or related, often in a before and after situation. Examples are the weight loss of individuals after a diet programme or productivity improvement after a training programme. Pareto diagram is a combined histogram and line graph. The frequency of occurrence of the data is indicated according to categories on the

Normal distribution transformation relationship is, z x μx σx

where z is the number of standard deviations, x is the value of the random variable, μx is the mean value of the dataset and σx is the standard deviation of the dataset. Non-linear regression is when the dependent variable is represented by an equation where the power of some or all the independent variables is at least two. These powers of x are usually integer values. Non-mutually exclusive can occur together. events are those that

Null hypothesis is that value that is considered correct in the experiment. Numerical codes are used to transpose qualitative or label data into numbers. This facilitates statistical analysis. For example, if the time period is January, February, March, etc. we can code these as 1, 2, 3, etc. Odds are the chance of winning and are the ratio of the probability of losing to the chances of winning. Ogive is a frequency distribution that shows data cumulatively. A less than ogive indicates data less than certain values and a greater than

Appendix 1: Key terminology and formula in statistics histogram and the line graph shows the cumulated data up to 100%. This diagram is a useful auditing tool. Parallel bar chart is similar to a parallel histogram but the x-axis and y-axis have been reversed. Parallel arrangement in design systems is such that the components are connected giving a choice to use one path or another. Which ever path is chosen the system continues to function. Parallel histogram is a vertical bar chart showing the data according to a category and within a given category there are sub-categories such as different periods. A parallel histogram is also referred to a side-by-side histogram. Parameter describes the characteristic of a population such as the weight, height, or length. It is usually considered a fixed value. Percentiles are fractiles that divide ordered data into 100 equal parts. Permutation is a combination of data arranged in a particular order. The number of ways, or permutations, of arranging x objects selected in order from a total of n objects is,

nP x

421

minimum probable level that we will tolerate in order to accept the null hypothesis of the mean or the proportion. Point estimate is a single value used to estimate the population parameter. Poisson distribution describes events that occur during a given time interval and whose average value in that time period is known. The probability relationship is, P(x) λx e λ x!

Polynomial function has the general form kxn, where x is y a bx cx2 dx3 the independent variable and a, b, c, d, …, k are constants. Population is all of the elements under study and about which we are trying to draw conclusions. Population standard deviation of the population variance. Population variance σ2 is the square root

is given by,

∑ (x

N

μx )2

n! (n x)!

Pictogram is a diagram, picture, or icon that shows data in a relative form. Pictograph pictogram. is an alternative name for the

where N is the amount of data, x is the particular data value, and μx is the mean value of the dataset. Portfolio risk measures the exposure associated with financial investments. Posterior probability is one that has been revised after additional information has been received. Power of a hypothesis test is a measure of how well the test is performing. Primary data source. is that collected directly from the

Pie chart is a circle graph showing the percentage of the data according to certain categories. The circle, or pie, contains 100% of the data. Platykurtic is when the curve of a distribution has a flat peak. Numerically this is shown by a larger value of the coefficient of variation, σ/μ. p-value in hypothesis testing is the observed level of significance from the sample data or the

Probability is a quantitative measure, expressed as a decimal or percentage value, indicating the likelihood of an event occurring. The value

422

Statistics for Business for a

[1 P(x)] is the likelihood of the event not occurring. Probabilistic is where there is a degree of uncertainty, or probability of occurrence from the supplied data. Quad-modal is when there are four values in a dataset that occur most frequently. Qualitative data is information that has no numerical response and cannot immediately be analysed. Quantitative data is information that has a numerical response. Quartiles are those three values which divide ordered data into four equal parts. Quartile deviation is one half of the interquartile range, or (Q3 Q1)/2. Questionnaires are evaluation sheets used to ascertain people’s opinions of a subject or a product. Quota sampling in market research is where each interviewer in the sampling experiment has a given quota or number of units to analyse. Random implies that any occurrence or value is possible. Random sample is where each item of data in the sample has an equal chance of being selected. Random variable is one that will have different values as a result of the outcome of a random experiment. Range is the numerical difference between the highest and lowest value in a dataset. Ratio measurement scale is where the difference between measurements is based on starting from a base point to give a ratio. The consumer price index is usually presented on a ratio measurement scale. Raw data is collected information that has not been organized.

Real value index (RVI) of a commodity period is, RVI Current value of commodity Base value of commodity * Base indicator * 100 Current indicator

Regression analysis is a mathematical technique to develop an equation describing the relationship of variables. It can be used for forecasting and estimating. Relative in this textbook context is presenting data compared to the total amount collected. It can be expressed either as a percentage or fraction. Relative frequency histogram has vertical bars that show the percentage of data that appears in defined class ranges. Relative frequency distribution shows the percentage of data that appears in defined class ranges. Relative frequency probability is based on information or experiments that have previously occurred. It is also known as empirical probability. Relative price index IP (Pn /P0 ) * 100,

where P0 is the price at the base period, and Pn is the price at another period. Relative quantity index IQ (Qn /Q0 ) * 100

where Q0 is the quantity at the base period and Qn is the quantity at another period. Relative regional index (RRI) compares the value of a parameter at one region to a selected base region. It is given by, Value at other region * 100 0 Value at base region V0 * 100 Vb

Appendix 1: Key terminology and formula in statistics Reliability is the confidence we have in a product, process, service, work team, individual, etc. to operate under prescribed conditions without failure. Reliability of a series system, RS is the product of the reliability of all the components in the system, or RS R1 * R2 * R3 * R4 * * Rn. The value of Rs is less than the reliability of a single component. Reliability of a parallel system, RS is one minus the product of all the parallel components not working, or RS 1 (1 R1)(1 R2)(1 R3) (1 R4) (1 Rn). The value of RS is greater than the reliability of an individual component. Replacement is when we take an element from a population, note its value, and then return this element back into the population. Representative sample is one that contains the relevant characteristics of the population and which occur in the same proportion as in the population. Research hypothesis is the same as the alternative hypothesis and is a value that has been obtained from a sampling experiment. Right-skewed data is when the mean of a dataset is greater than the median value, and the curve of the distribution tails off to the right side of the x-axis. Right-tail hypothesis test is used when we are asking the question, Is there evidence that a value is greater than? Risk is the loss, often financial, that may be incurred when an activity or experiment is undertaken. Rolling index number is the index value compared to a moving base value often used to show the change of data each period. Sample is the collection of a portion of the population data elements. Sampling is the analytical procedure with the objective to estimate population parameters. Sampling distribution of the means is a distribution of all the means of samples withdrawn from a population. Sampling distribution of the proportion is a probability distribution of all possible values of the sample proportion, –. p Sampling error is the inaccuracy in a sampling experiment. Sample space gives all the possible outcomes of an experiment. Sample standard deviation, s is the square root of the sample variation, s2. Sample variance, s2 is given by,

423

s2

∑ (x

(n

x )2 1)

where n is the amount of data, x is the particu– lar data value, and x is the mean value of the dataset. Sampling from an infinite population means that even if the sample were not replaced, then the probability outcome for a subsequent sample would not significantly change. Sampling with replacement is taking a sample from a population, and after analysis, the sample is returned to the population. Sampling without replacement is taking a sample from a population, and after analysis not returning the sample to the population. Scatter diagram is the presentation of time series data in the form of dots on x-axis and y-axis to illustrate the relationship between the x and y variables. Score is a quantitative value for a subjective response often used in evaluating questionnaires.

424

Statistics for Business Seasonal forecasting is when in a time series the value of the dependent variable is a function of time but also varies often in a sinusoidal fashion according to the season. Secondary data is the published information collected by a third party. Series arrangement is when in system, components are connected sequentially so that you have to pass through all the components in order that the system functions. Shape of the sampling distribution of the means is about normal if random samples of at least size 30 are taken from a non-normal population; if samples of at least 15 are withdrawn from a symmetrical distribution; or samples of any size are taken from a normal population. Side-by-side bar chart is where the data is shown as horizontal bars and within a given category there are sub-categories such as different periods. Side-by-side histogram is a vertical bar chart showing the data according to a category and within a given category there are sub-categories such as different periods. A side-by-side histogram is also referred to as a parallel histogram. Significantly different means that in comparing data there is an important difference between two values. Significantly greater means that a value is considerably greater than a hypothesized value. Significantly less means that a value is considerably smaller than a hypothesized value. Significance level in hypothesis testing is how large, or important, is the difference before we say that a null hypothesis is invalid. It is denoted by α, the area outside the distribution. Simple probability is an alternative for marginal or classical probability. Simple random sampling is where each item in the population has an equal chance of being selected. Skewed means that data is not symmetrical.

Stacked histogram shows data according to categories and within each category there are sub-categories. It is developed from a crossclassification or contingency table. Standard deviation of a random variable the square root of the variance or, σ is

∑ (x

μx )2 P(x)

Standard deviation of the binomial distribution is the square root of the variance, or σ σ2 (npq). Standard error of the difference between two proportions is, σp

p2

1

p1q1 n1

p2 q 2 n2

**Standard deviation of the distribution of the difference between sample means is, σx
**

2 σ1 n1 2 σ2 n2

1

x2

Standard deviation of the Poisson distribution is the square root of the mean number of occurrences or, σ (λ). Standard deviation of the sampling distribution, – σx , is related to the population standard deviation, σx, and sample size, n, from the central limit theory, by the relationship, σx σx n Standard error of the estimate regression line is, se of the linear

∑ (y

n

ˆ y)2 2

**Appendix 1: Key terminology and formula in statistics Standard error of the difference between two means is, σx
**

2 σ1 n1 2 σ2 n2 – σp is,

425

Stratified sampling is when the population is divided into homogeneous groups or strata and random sampling is made on the strata of interest. Student-t distribution is used for small sample sizes when the population standard deviation is unknown. Subjective probability is based on the belief, emotion or “gut” feeling of the person making the judgment. Symmetrical in a box and whisker plot is when the distances from Q0 to the median Q2, and the distance from Q2 to Q4, are the same; the distance from Q0, to Q1 equals the distance from Q3 to Q4 and the distance from Q1 to Q2 equals the distance from the Q2 to Q3; and the mean and the median value are equal. Symmetrical distribution is when one half of the distribution is a mirror image of the other half. System is the total of all components, pieces, or processes in an arrangement. Purchasing, transformation, and distribution are the processes of the supply chain system. Systematic sampling is taking samples from a homogeneous population at a regular space, time or interval. Time series is historical data, which illustrate the progression of variables over time. Time series deflation is a way to determine the real value in the change of a commodity using the consumer price index. Transformation relationship is the same as the normal distribution transformation relationship. Tri-modal is when there are three values in a dataset that occur most frequently. Type I error occurs if the null hypothesis is rejected when in fact the null hypothesis is true.

1

x2

Standard error of the proportion, σp pq n p(1 p) n

Standard error of the sample means, or more simply the standard error is the error in a sampling experiment. It is the relationship, σx σx n

Standard error of the estimate in forecasting is a measure of the variability of the actual data around the regression line. Standard normal distribution is one which has a mean value of zero and a standard deviation of unity. Statistic describes the characteristic of a sample, taken from a population, such as the weight, volume length, etc. Statistical dependence is the condition when the outcome of one event impacts the outcome of another event. Statistical independence is the condition when the outcome of one event has no bearing on the outcome of another event, such as in the tossing of a fair coin. Stems are the principal data values in a stemand-leaf display. Stem-and-leaf display is a frequency distribution where the data has a stem of principal values, and a leaf of minor values. In this display, all data values are evident.

426

Statistics for Business Type II error is accepting a null hypothesis when the null hypothesis is not true. Two-tail hypothesis test is used when we are asking the question, “Is there evidence of a difference?” Unbiased estimate is one that on an average will equal to the parameter that is being estimated. Univariate data is composed of individual values that represent just one random variable, x. Unreliability is when a system or component is unable to perform as specified. Unweighted aggregate index is one that in the calculation each item in the index is given equal importance. Variable value is one that changes according to certain conditions. The ending letters of the alphabet, u, v, w, x, y, and z, either upper or lower case, are typically used to denote variables. Variance of a distribution of a discrete random variable is given by the expression, σ2 istic probability, p, of success, and the characteristic probability, q, of failure, or σ2 npq. Venn diagram is a representation of probability outcomes where the sample space gives all possible outcomes and a portion of the sample space represents an event. Vertical histogram is a graphical presentation of vertical bars where the x-axis gives a defined class and the y-axis gives data according to the frequency of occurrence in a class. Weighted average is the mean value taking into account the importance or weighting of each value in the overall total. The total weightings must add up to 1 or 100%. Weighted mean average. is an alternative for the weighted

Weighted price index is when different weights or importance is given to the items used to calculate the index. What if is the question asking, “What will be the outcome with different information?” Wholes numbers are those with no decimal or fractional components.

∑ (x

μx )2 P(x)

Variance of the binomial distribution is the product of the number of trials n, the character-

Appendix 1: Key terminology and formula in statistics

427

**Symbols used in the equations
**

Symbol λ μ n N p q Q r r2 s σ ˆ σ se t x – x y – y ˆ y z Meaning Mean number of occurrences used in a Poisson distribution Mean value of population Sample size in units Population size in units Probability of success, fraction or percentage Probability of failure (1 p), fraction or percentage Quartile value Coefficient of correlation Coefficient of determination Standard deviation of sample Standard deviation of population Estimate of the standard deviation of the population Standard error of the regression line Number of standard deviations in a Student distribution Value of the random variable. The independent variable in the regression line Average value of x Value of the dependent variable Average value of y Value of the predicted value of the dependent variable Number of standard deviations in a normal distribution

Note: Subscripts or indices 0, 1, 2, 3, etc. indicate several data values in the same series.

This page intentionally left blank

**Appendix II: Guide for using Microsoft Excel in this textbook
**

(Based on version 2003)

The most often used tools in this statistics textbook are the development of graphs and the built-in functions of the Microsoft Excel program. To use either of these you simply click on the graph icon, or the object function of the toolbar in the Excel spreadsheet as shown in Figure E-1. The following sections give more Figure E.1 Standard tool bar, Excel version 2003. information on their use. Note in these sections the words shown in italics correspond exactly to the headings used in the Excel screens but these may not always be the same terms as used in this textbook. For example, Excel refers to chart type, whereas in the text I call them graphs.

**Generating Excel Graphs
**

When you click on the graph icon as shown in Figure E-1 you will obtain the screen that is illustrated in Figure E-2. Here in the tab Standard Types you have a selection of the Chart type or graphs that you can produce. The key ones that are used in this text are the first five in the list – Column (histogram), Bar, Line, Pie, and XY (Scatter). When you click on any of these options you will have a selection of the various formats that are available. For example, Figure E-2 illustrates the Chart sub-type for the Column options and Figure E-3 illustrates the Chart subtypes for the XY (Scatter) option.

Assume, for example, you wish to draw a line graph for the data given in Table E-1 that is contained in an Excel spreadsheet. You first select (highlight) this data and then choose the graph option XY (Scatter). You then click on Next and this will illustrate the graph you have formed as shown in Figure E-4. This Step 2 of 4 of the chart wizard as shown at the top of the window. If you click on the tab, Series at the top of the screen, you can make modifications to the input data. If you then click on Next again you will have Step 3 of 4, which gives the various Chart options for presenting your graph. This window is shown in Figure E-5. Finally, when you again click on Next you will have Chart Location according to the screen shown in

.3 XY graphs selected. Figure E.2 Graph types available in Excel.430 Statistics for Business Figure E.

Y line graph. . y data.5 Options to present a graph.Appendix II: Guide for using microsoft excel in this textbook 431 Table E-1 x y 1 5 x. 2 9 3 14 4 12 5 21 Figure E.4 X. Figure E.

In Chapter 10.432 Statistics for Business Figure E-6. For organizing my data I always prefer to create a new sheet for my graphs. Figure E. but the choice is yours! Regardless of what type of graph you decide to make. . the procedure is the same as indicated in the previous paragraph.6 Location of your graph. Figure E-7 shows the screen for developing this linear regression line. that is as a new file for your graph or As object in.7 Adding regression line. One word of caution is the choice in the Standard Types between using Line and XY (Scatter). For any line graph I always use XY (Scatter) rather than Line as with this presentation the x and y data are always correlated. which is the graph in your spread sheet. Figure E. This gives you a choice of making a graph As new sheet. we discussed in detail linear regression or the development of a straight line that is the best fit for the data given.

a number without its sign.8 Selecting functions in Excel. If you are in doubt and you want further information about using a particular function you have. “Help on this function” at the bottom of the screen. For example. Table E-2 English ABS AVERAGE EXPONDIST Excel functions used in this book. This gives a listing of all the functions that are available in Excel in alphabetical order. and select. you will have the screen as shown in Figure E-8. Returns the absolute value of a number. That is the negative numbers are ignored Mean value of a dataset Cumulative distribution exponential function. you have the equivalent functions in French!) 433 Using the Excel Functions If you click on the fx object in the tool bar as shown in Figure E-1.Appendix II: Guide for using microsoft excel in this textbook on the bottom of the screen. Or select a category. French ABS MOYENNE LOI. given the value of the ransom variable x and the mean value λ. (Note. When you highlight a function it tells you its purpose. in the command.SUP . Each function indicated can be found in appropriate chapters of this textbook. here the function ABS is highlighted and it says Figure E. Use a value of cumulative 1 Rounds up a number to the nearest integer value CEILING ARRONDI. for those living south of the Isle of Wight. Table E-2 gives those functions that are used in this textbook and their use. All.EXPONENTIELLE For determining Gives the absolute value of a number.

and cumulative 1 IF FACT FLOOR FORECAST SI FACT ARRONDI. This function is in the tools menu Gives the kurtosis value. or the peakness of flatness of a dataset Gives the parameters of a regression line Determines the highest value of a dataset Middle value of a dataset Determines the lowest value of a dataset Determines the mode. x.GEOMETRIQUE GOAL SEEK KURT LINEST MAX MEDIAN MIN MODE NORMDIST VALEUR CIBLE KURTOSIS DROITEREG MAX MEDIANE MIN MODE LOI. data assuming a linear relationship between the two Determines how often values occur in a dataset Gives the geometric mean growth rate from the annual growth rates data.BINOMIALE For determining Gives the area in the chi-distribution when you enter the chi-square value and the degrees of freedom Gives the chi-square value when you enter the area in the chi-square distribution and the degrees of freedom Gives the area in the chi-square distribution when you enter the observed and expected frequency values Gives the number of combinations of arranging x objects from a total sample of n objects Returns the confidence interval for a population mean Determines the coefficient of correlation for a bivariate dataset The number of values in a dataset Returns the inverse of the one-tailed probability of the chi-squared distribution Binomial distribution given the random variable.KHIDEUX KHIDEUX.INVERSE LOI. and characteristic probability.CORRELATION NBVAL KHIDEUX.INVERSE . the cumulative values are determined Evaluates a condition and returns either true or false based on the stated condition Returns the factorial value n! of a number Rounds down a number to the nearest integer value Gives a future value of a dependent variable. If cumulative 1. mean value. μ.NORMALE.KHIDEUX COMBIN INTERVALLE. The percentage of geometric mean is the geometric mean growth rate less than 1 Gives a value according to a given criteria. p. y. or that value which occurs most frequently in a dataset Area under the normal distribution given the value of the random variable. mean value. the individual value is determined. (Continued) French LOI. If cumulative 0. standard deviation. standard deviation.434 Statistics for Business Table E-2 English CHIDIST CHIINV CHITEST COMBIN CONFIDENCE CORREL COUNT CHIINV BINOMDIST Excel functions used in this book.INVERSE TEST. p. from known variables x and y. If you use cumulative 0 this gives a point value for exactly x occurring Value of the random variable x given probability. and cumulative 1. μ.CONFIANCE COEFFICIENT.NORMALE NORMINV LOI. σ. σ.INF PREVISION FREQUENCY GEOMEAN FREQUENCE MOYENNE. x.

01. and the mean value. the area to the right is determined. 0. given the number of standard deviations z Determines the number of standard deviations. and the degree of freedom.NORMALE. z. INVERSE DECALER PEARSON CENTILE PERMUTATION LOI.STANDARD.STUDENT. x. the area in both tails is determined Determines the value of the Student-t given the probability or area outside the curve. French LOI. given the value of the probability. or the coefficient of correlation.INVERSE GOAL SEEK VAR VAR. the individual value is determined. etc Gives the number of permutations of organising x objects from a total sample of n objects Poisson distribution given the random variable.02.STANDARD LOI. If cumulative 0. If number of tails 2. υ Gives a target value based on specified criteria Determines the variance of a dataset on the basis it is a sample Determines the variance of a dataset on the basis it is a population POWER RAND RANDBETWEEN ROUND RSQ PUISSANCE ALEA ALEA. Select the data and enter the percentile.BORNES ARRONDI COEFFICIENT. r Gives the percentile value of a dataset. x.ENTRE. r2 or gives the square of the Pearson product moment correlation coefficient Logical statement to test a specified condition Determines the slope of a regression line Gives the square root of a given value Determines the standard deviation of a dataset on the basis for a sample Determines the standard deviation of a dataset on the basis for a population Determines the total of a defined dataset Returns the sum of two columns of data Probability of a random variable. p. 0.P . λ. p Repeats a cell reference to another line or column according to the offset required Determines the Pearson product moment correlation. the cumulative values are determined Returns the result of a number to a given power Generates a random number between 0 and 1 Generates a random number between the numbers you specify Rounds to the nearest whole number Determines the coefficient of determination. given the degrees of freedom. p.STUDENT TINV VALEUR CIBLE VAR VARP LOI. If cumulative 1. υ and the number of tails.Appendix II: Guide for using microsoft excel in this textbook 435 Table E-2 English NORMSDIST NORMSINV OFFSET PEARSON PERCENTILE PERMUT POISSON Excel functions used in this book.POISSON For determining The probability.DETERMINATION IF SLOPE SQRT STDEV STDEVP SUM SUMPRODUCT TDIST SI PENTE RACINE ECARTYPE ECARTYPEP SOMME SOMMEPROD LOI.NORMALE. If the number of tails 1.

standard error for intercept a bk. coefficient of determination F-ratio SSreg. A virgin block of cells at least two columns by five rows are selected. Table E-3 b seb r2 F SSreg Microsoft Excel and the linear regression function. standard error for slope bk 1 se. sum of squares of residual (unexplained variation) b2. standard error for slope bk r2. b coefficient of determination F-ratio for analysis of variance sum of squares due to regression (explained variation) a sea se df SSresid intercept on y-axis standard error for intercept a standard error of estimate degree of freedom (n 2) sum of squares of residual (unexplained variation) Table E-4 Microsoft Excel and the multiple regression function. Multiple Regression As for simple linear regression. intercept on y-axis sea. Here now a virgin block of cells is selected such that the number of columns is at least equal to the number of variables plus one and the number of rows is equal to five. degree of freedom SSresid. the various statistical data are returned in a format according to Table E-4. slope due to variable xk 1 sek 1. slope due to variable x1 se1. the various statistical data are returned in a format according to Table E-3.436 Statistics for Business Simple Linear Regression Simple linear regression functions can be solved using the regression function in Excel. standard error for slope b1 a. When the y and x data are entered into the function. slope due to variable xk sek. slope due to variable x2 se2. sum of squares due to regression (explained variation) . When the y and x data are entered into the function. standard error for slope b2 b1. Slope due to variable x Standard error for slope. standard error of estimate df. multiple regression functions can be solved with the Excel regression function. bk 1.

c. w. x. Σ • Mean value • Addition of two variables • Difference of two variables • Constant multiplied by a variable • Constant summed n times • Summation of a random variable around the mean • Binary numbering system • Greek alphabet Statistics involves numbers and the material in this textbook is based on many mathematical relationships. and conversions. This is bivariate data. articles. z Upper case U. d. the letter z is used to denote the third axis. W. and y is the ordinate or vertical axis. x is the abscissa or horizontal axis. rules.Appendix III: Mathematical relationships Subject matter Your memory of basic mathematical relationships may be rusty. I prefer to use the lower case. In three-dimensional graphs. However. Where twodimensional graphs occur. and weather conditions. By convention. The following summarizes the basics. Lower case a. The objective of this appendix is to give a detailed revision of arithmetic relationships. the driving time from these two points is a variable as it depends on road. Constants and variables A constant is a value which does not change under any circumstances. b. C. however. constants are represented algebraically by the beginning letters of the alphabet either in lower or upper case. v. traffic. …… . B. y. V. …… Upper case A. By convention variables are represented algebraically by the ending letters of the alphabet again either in lower or upper case. and other documents you will see constants and variables written in either upper case or lower case. Lower case u. A variable is a number whose value can change according to various conditions. and conversion factors. e. The following concepts are covered: Constants and variables • Equations • Integer and non-integer numbers • Arithmetic operating symbols and equation relationships • Sequence of arithmetic operations • Equivalence of algebraic expressions • Fractions • Decimals • The Imperial and United States measuring system • Temperature • Conversion between fractions and decimals • Percentages • Rules for arithmetic calculations for non-linear relationships • Sigma. Y. fundamental ideas. In textbooks. X. The straight line distance from the centre of Trafalgar Square in London to the centre of the Eiffel Tower in Paris is constant. There seems to be no recognized rule. Z The variables denoted by the letters x and y are the most commonly encountered. D. E.

If there are no brackets in the expression and only addition and subtraction operating symbols then you work from left to write. etc. 0. or ⁄ ⁄ ⁄ ⁄ decimals such as 2. Greater than Less than Greater or equal to / . a b.56. Integer and non-integer numbers An integer is a whole number such as 1. When we multiply two algebraic terms a and b together this can be shown as: ab. Sequence of arithmetic operations When we have expressions related by operating symbols the rule for calculation is to start first to calculate the terms in the Brackets. In statistics an integer is also known as a discrete number or discrete variable if the number can take any different values. and before we had computers. The following is a linear equation meaning that the power of the variables has the value of unity: y a bx Less than or equal to Approximately equal to For multiplication we have several possibilities to illustrate the operation. etc. 5.438 Statistics for Business Equations An equation is a relationship where the values on the left of the equal sign are equal to the values on the right of the equal sign. 19. y a bx3 cx2 d With numbers. 25. 2. 734. Table M-1 Sequence of arithmetic operations. 34. the symbol * is used for the multiplication sign rather than the historical symbol. a.75.79. 34⁄ means the ratio of 3 to 4 but also 3 divided by 4. then Division and/or Multiplication.b. or a * b This equation represents a straight line where the constant cutting the y-axis is equal to a and the slope of the curve is equal to b. For example. and finally Addition and/or Subtraction (BDMAS) as shown in Table M-1. An equation might be non-linear meaning that the power of any one of the variables has a value other than unity as for example. Values in any part of an equation can be variables or constants. or 312. Non-integer numbers are those that are not whole numbers such as the fractions 12. the multiplication or product of two values was written using the symbol for multiplication: 6 4 24 With Excel the symbol * is used as the multiplication sign and so the above relationship is written as: 6*4 24 It is for this reason that in this textbook. Symbol B D M A S Term Brackets Division Multiplication Addition Subtraction Evaluation sequence 1st 2nd 2nd Last Last Arithmetic operating symbols and equation relationships The following are arithmetic operating symbols and equation relationships: Addition Subtraction Plus or minus Equals Not equal to Divide This means ratio but also divide. Table M-2 gives some illustrations. and 0.

and M-8. multiplication. Fractions Fractions are units of measure expressed as one whole number divided by another whole number. In this case these improper fractions can be reduced to a whole number and proper fractions to give 42 ⁄ 7. which means that the number is greater than unity as for 19 30 52 example 7 9 . Tables M-5. give Table M-2 Expression Calculation procedures for addition and subtraction. then subtraction Multiplication and divisions first then addition 25 11 7 9*6 4 22 * 4 12 * 6 6 9*5 3 7(9) 9(5 7) (7 4)(12 3) 6 20 * 3 10 11 Table M-3 Algebraic and numerical expressions.000 9/100 7. subtracting multiplying. Decimals A decimal number is a fraction. The common fraction has the numerator on the top and the denominator on the bottom: Common fraction Numerator Denominator is greater than the denominator. and 3 . 34 and 5 ⁄ 12. and dividing fractions are given in Table M-4. Example b) c 6 7 7 9 (7 3) 15 21 6*7 7*6 3 * (8 4) 6 13 9 7 3 (9 7) 21 15 6 42 3 * 8 3 * 4 36 3 19 Arithmetic rule a b b a a (b c) a b c (a a b b a a*b b*a a * (b c) a * b a * c . The improper fraction is when the numerator The metric system. 57⁄ 9 and 61⁄ 3. whose denominator is any power of 10 so that it can be written using a decimal point as for example: 7/10 0.70 7. used in continental Europe.051 0. M-6. is based on the decimal system and changes in units of 10.09 The common fraction is when the numerator is less than the denominator which means that the ⁄ ⁄ number is less than one as for example. The rules for adding.Appendix III: Mathematical relationships 439 Equivalence of algebraic expressions Algebraic or numerical expressions can be written in various forms as Table M-3 illustrates. Answer 21 50 88 72 48 63 108 21 17 Operation Calculate from left to right Multiplication before subtraction A minus times a plus is a minus Minus times a minus equals a minus Multiplication then addition and subtraction A bracket is equivalent to a multiplication operation Addition in the bracket then the multiplication Expression in brackets. 17. M-7.051/1.

000001 Hectare (ha) 100 1 0.000 100 10 1 0.000 10.000 100 Square centimetre (cm2) 1010 108 1.00000001 10 10 .001 0.000.01 Metre (m) 1.000.000 100. a ab b c b c 1 5 4 5 4 7 1 6 16 5 2 7 4 7 4*7 5*6 4*6 5*7 6 5 5*6 4 5 2 2 7 28 30 24 35 14 15 16 11 30 20 5 4 a a a *b c *d a b * c d a c b d 4 7 * 5 6 4 5 7 6 a *d c *b Table M-5 Micrometre (μm) 109 108 10.0001 0.000 100.440 Statistics for Business Table M-4 1 a a c a c 1 b b c b c b Rules for treating fractions.1 Decimetre (dm) 10.00001 Kilometre (km) 1 0.1 0.000 100 1 Square decimetre (dam2) 108 1. Millimetre (mm) 1. Square millimetre (mm2) 1012 1010 108 1.1 0.000 100 1 0.000001 Table M-6 Square micrometre (μm2) 1018 1016 1014 1012 1010 108 Surface or area measure.000 10.001 0.000 1.000 100 1 0.000 10.0001 0.000 10.01 0.001 0.000001 0.01 0.000 100 1 0.000001 0.000 Length or linear measure.01 0.01 0.100 0.01 0.0001 0.1 0.1 0.000 100 10 1 Centimetre (cm) 100.01 0.000 10.000 1.000 100 10 1 0.000 10.000.000 1.000 1.000 100 10 1 0.00001 0.000.00000001 Square kilometre (km2) 1 0.000 10.000.0001 Hectometre (hm) 10 1 0.0001 0.0001 0.001 Decametre (dam) 100 10 1 0.000 1.010 0.01 0.000.0001 Are (a) 10.000.01 Square metre (m2) 1.

0001 0.00001 10 6 10 7 10 8 10 9 Cubic centimetre (cm3) 106 100.01 0.01 0.01 0.1 0.0001 0.1 0.001 0.001 0.000 100 10 1 0.000 100 10 1 0.000 1.01 0.0001 0.001 0.0001 Decilitre (dl) 10.1 0. Millilitre (ml) 106 100.000 1.000 10.000 100 10 1 0.1 0.0001 0.00001 10 6 10 7 10 8 Kilolitre (kl) 1 0.00001 10 6 10 7 Hectolitre (hl) 10 1 0.01 0.1 0.001 0.00001 Litre (l) 1.000 1.000 1.00001 10 6 10 7 10 8 10 9 Appendix III: Mathematical relationships 441 .Table M-7 Microlitre (μl) 109 108 107 106 100.1 0.000 100 10 1 0.001 0.001 0.000 1.000 10.01 0.001 0.1 0.01 0.0001 0.1 0.01 0.0001 0.00001 10 6 Cubic metre (m3) 1 0.01 0.000 100 10 1 Volume or capacity measure.000 10.000 100 10 1 0.000 100 10 1 0.001 0.1 0.0001 0.001 Centilitre (cl) 100.001 Cubic decimetre (dm3) 1.00001 10 6 Decalitre (dal) 100 10 1 0.000 10.01 0.1 0.

3701 39.000 10.001 0.1 0.000 100. the Fahrenheit system is used where the freezing point of water is given at 32°F and the boiling point as 212°F.000 Kilometres (km) 0.01 0.609.01 0.3333 1 1. and metric measuring system The Imperial measuring system is used in the United States and partly in England though there are efforts to change to the metric system.00001 0.144 16.3937 39. volume.934.000 100 10 1 0.00001 0. US. M-10.000001 10 7 10.1 0.4800 91. The Imperial.0254 0.08 Conversions for length or linear measurement.093.360 0.0006 0.1 0.000 10.000 Metres (m) 0.001 0.000 100 10 1 0.00001 0.000 1.0001 0.001 0.0001 107 10.4400 160.000 1.000 100 10 1 109 1.01 0.0001 0.000001 10 7 10 8 10 9 10 10 10 11 10 12 Microgram Milligram Centigram Decigram (μg) (mg) (cg) (dg) 1012 109 108 107 1. surface.0000254 0.1 0.000 1 0. and weight.3048 0.00001 0.1 10 10.2137 106 0. M-11. In the United States.001 0.01 1 1.344 0.00001 Table M-9 Inches (in) 1 12 36 63.000 10.000 10 1 0.1 0.0001 0.000 100 10 1 0.1 0.000 100.0001 0.6133 Miles (mi) 1. Gram (g) 1.0001 0.000001 10 7 10 8 1. Tables M-9. M-12.6093 0.000001 10 7 10 8 10 9 1 0.0006 1 6.001 0. Feet (ft) 0.0009144 1.000.0328 3.280.1 0.5400 30.001 108 100. .01 0.0936 1. give approximate conversion tables for key measurements. and M-13 Temperature In Europe.000.001 0.84 Yards (yd) 0.000.442 Statistics for Business Table M-8 Mass or weight measure.000 100 10 1 0.280 0.0001 0.00001 0.254 3.0109 1.00001 0.01 0.000 1. Here the freezing point of water is measured at 0°C and the boiling point is 100°C.001 0.2808 3.0002 0.000 1.6214 Millimetres (mm) 0. and sometimes in the United Kingdom. usually the Celsius system is used for recording temperature.048 9.093.000 100 10 1 0.000 1.000001 Decagram Hectogram Kilogram Metric ton (dag) (hg) (kg) (t) 100.000 Centimetres (cm) 2. The Imperial numbering system is quirky with no apparent logic as compared to the metric system.01 0.9144 1.44 0.01 0.5783 105 0.370.0003048 0.001 1 the relationships for length.40 1 100 100.0833 1 3 5.760 0. The United States system is not always the same as the Imperial measuring system.0278 0.

3 1011 0.840 Conversions for mass or weight measure.4047 0.6005 0.016.0002 4.03 0.4 10 7 0.064 Kilograms (kg) 0.47 0.6652 0.3 10 9 0.3003 Conversions for capacity or volume measure.2730 1.0000 2.2011 Imperial gallon 0.26 107 0.0005 0.0000 2.00002835 0.26 105 259.07031 1 The following is the relationship between the two scales: C When F 5 9 (F 32) F 9 5C 32 32) 100ºC 212ºF then C 5 9(212 ⁄ 59 ⁄ * 180 .016.3304 1.8326 0.4516 0.1200 Long ton 0.59 0.3550 Litres (l) 3.1365 Table M-12 Ounce (oz) 1 16 32. Square yard (yd2) Square mile (mi2) Are (a) Square centimetre (cm2) Square metre (m2) Area Hectare (ha) Square kilometre (km2) Square feet (ft2) 1 0.0625 1 2.2 1.0006 0.560 4.8 10 6 0.4 108 4.00003125 0.0929 0.6652 0.2231 Conversion for pressure.3 10 6 0.2500 Imperial quart 1.4163 0.9072 1.4 10 0.7 10 5 0.840 0.4536 907.8 10 4 0.4200 138.2082 2.8326 0.0000 1.8361 0.0016 1 6.28 108 3.4163 4. USA quart 2.064 Metric ton 0.0000 1.350 453.5000 0.2 10 6 144 1 0.2011 0.4731 4.0161 Table M-13 psi 1 14.4021 1.01 109 0.000 2.02835 0.5460 2.5000 28.Appendix III: Mathematical relationships 443 Table M-10 Square inch (in2) Conversions for surface or area measure.07031 1 bar 0.1111 0.63 107 43.296 9 1 0.7 10 7 0.2 10 4 1.2011 0.0084 0. Pound (lb) 0.8042 2.8929 1 Grams (g) 28.5000 2.60 907.5000 0.004 Table M-11 USA gallon 1.200 1.7 10 9 929.0000 115.9 10 7 3 8.0000 0.1041 1.8750 277.0005 1 1.2500 1.0000 0.0008 0.05 103 40.7850 1.1 106 1 640 0.5000 Imperial pint 3.7100 69.4021 1.000027902 0.000 35.0004 0.0000 1.0000 Cubic inches (in3) 231.0000 4.0069 0.0009 0.0000 0.6005 USA pint 4.0000 0.9 10 5 0.8925 0. kg/cm2 0.240 Short ton 0.00 2.0000 1.

That is. 11. you find the lowest common denominator. x3.1875 Fraction Decimal 1/2 0. usually written. n: x ∑x n M(ii) .6712 * 100 67. Σ Very often in statistics we need to determine the sum of a set of data and to indicate this we use the Greek letter sigma. etc. x2.33 3/4 0. Σ. you divide the numbers after the decimal point by 100 and simplify until both the numerator and denominator are the smallest possible integer values. 21.875 When C 100ºC then F 9⁄5 C 32 9⁄5 100 * 212ºF 32 A percentage to a fraction divide by 100 25% 25/100 1 ⁄ 4 A percentage to a decimal move the decimal place two places to the left 25.444 Statistics for Business Table M-14 Conversion of fraction and decimals. and 9 then the sum of these values is: Percentages The term percent means per 100 so that 50 percent. Table M-14 gives some examples. Conversion between fractions and decimals To convert from a fraction to a decimal representation you divide the numerator or upper value in the fraction.6667 2(3⁄16) 2. To convert from a three-decimal presentation to a fraction you divide to numbers after the decimal point by 1. is 50 per 100. Rules for arithmetic calculations for non-linear relationships Table M-15 gives algebraic operations when the power of the variable.125 7/8 0. 3.875 6(2⁄3) 6. or the constant. 50%.000 etc. 2.2586 When C When F 35ºC then F 104ºF then C 9⁄5 C 9⁄5 32 35 32 * * 72 95ºF 5 9 (104 ⁄ 59 ⁄ 32) 40ºC. are the individual values in the dataset. Σx divided by the number of observations.75 1/8 0. If we have a set of n values of data of a variable x then the sum of these is written as: Σx x1 x2 x3 x4 … xn M(i) where x1. In order to change from A fraction to a percentage multiply by 100 3 ⁄ 4 75% (3⁄ 4 * 100 75%) A decimal to a percentage multiply by 100 0. To convert from a two-decimal presentation to a fraction.50 1/3 0.12% Σx 7 3 2 11 21 9 53 Mean value The mean or average of a dataset is equal to the sum of the individual values. Sigma. is non-linear. If we have a dataset consisting of the values 7. 4(7⁄8) 4. by the denominator or the lower value in the fraction.86% 0.6712 0.

(9. then n x (1 5 9 2 6) 5 9 5 1. ( 2. 6). ( 5. ( 2. 2). 5). Example 53 * 52 (5 ) 1/2 3 2 0 32 2 2 2 Arithmetic rule xaxb (x ) 1/x a a ab a b a x(a x x x x ab a b) 5(3 5 2 5 6 6 2 2) 55 3. 5. (1. ∑ (x − y) ∑ x ∑ y If we have the following five values.625 0. (9. y) 8). 4) M(iv) ∑ (x y) ∑x ∑ y M(iii) ∑ (x (1 6) ( 5 2) (9 ( 2 5) (6 4) 5 3 17 7 2 4 8) Assume the five values of the dataset (x.Appendix III: Mathematical relationships 445 Table M-15 Algebraic operations involving powers. (6. ( 5. Difference of two variables The sum of the difference of two variables is equal to the sum of the individual differences of each variable: Addition of two variables The total of the addition of two variables is equal to the total of the individual sum of each variable: (1. 5). 2. 6. 6). (6. 4): ∑ (x y) ∑x ∑ y ∑x ∑ y (1 5 9 2 6) (6 2 8 5 4) 9 5 4 ∑ (x y) (1 6) ( 5 2) (9 ( 2 5) (6 4) 10 14 8) ∑ (x y) 7 7 1 3 ∑ x ∑ y (1 5 9 (6 2 Constant multiplied by a variable The sum of a constant times a variable equals to the constant times the sum of the variables: 2 6) 8 5 4) ∑x ∑ y 9 5 14 ∑ (kx) k∑ x M(v) . 8). 9.125 15. 2).80 5 and. y) are.6667 If x has values of 1.25 51 6 0 x /x x /x (a b) (a a) 5 /5 x 1 6 /6 (3 2) (2 2) 5 1 4*5 2 3 20 x *y x y x y x * y 16 * 25 64 144 16 * 25 64 144 8 12 0.

nx Greek alphabet ∑x n ∑x nx M(xi) Thus substituting in equation M(x) we have. 5 5 5 5 5 5 Summation of a random variable around the mean Summation rules can be used to demonstrate that the summation of a random variable around a mean is equal to zero. Related to this is the binary numbering system or binary code which is a system of arithmetic based on two digits. ∑k Binary numbering system 30 5 6 In the textbook. n*k M(vi) That is the sum of the random variables about the mean is equal to zero. etc. The on/off system of most electrical appliances is based on the binary system where 0 off and 1 on. 6. either the 0 or the 1 in the binary code is called the bit. the 2nd position to the immediate left of the 1st position. 45 Table M-16 system. both in upper and lower case and their English equivalent. reading from right to left. The arithmetic of computers is based on the binary code. 2 5 6 2. . 9. A digit. Concept of binary numbering ∑ (kx) k∑ x 5 1 5 5 5 5(1 5 9 9 5 45 2 6) ) Constant summed n times A constant summed n times is equal to the n times the constant: 0001 0010 0100 1000 Value of 1 in 1st position is equal to Value of 1 in 2nd position is equal to Value of 1 in 3rd position is equal to Value of 1 in 4th position is equal to 1 2 4 8 ∑k If n 6. zero and one. Table M-18 gives the Greek letters. ∑x ∑x ∑x nx 0 M(x) From equation M(ii).446 Statistics for Business If k 5. the actual value of 1 depends on the position of the 1 in a binary number. and k 5. the resulting number is twice the original number. and decimal numbers from 1 to 10. x. To write these Greek letters in Microsoft Word you write the English letter and then change the font to symbol. 5. and a zero is placed after it. and some of the areas where they are often used. A digit doubles its value each time it moves one place further to the left as shown in Table M-16. ∑x ∑x 0 M(viii) – For any fixed set of data. or binary digit. we introduced the binomial distribution. In the binary code. ∑ (x x) ∑x ∑x ∑x ∑x ∑x 0 M(xii) In statistics. we are reading from the right so that the 1st position is at the extreme right. then. ∑x ∑ (x x) nx M(ix) Thus from equation M(viii) we have. x Or. Or. ∑ (x ∑ (x x) x) 0 M(vii) From equation M(iv) equation M(vii) becomes. Table M-17 gives the equivalent between binary numbers. Note. is a constant and thus from equation M(vi). letters of the Greek alphabet are sometimes used as abbreviations to denote various terms. When a binary digit is moved one space to the left. and x has values of 1.

standard deviation (lower case) Chi-square test in hypothesis testing Electricity . Lower case α β γ δ ε ζ η θ ι κ λ μ ν ξ o π ρ σ τ υ φ. ϕ χ ψ ω Name Alpha Beta Gamma Delta Epsilon Zeta Eta Theta Iota Kappa Lambda Mu Nu Xi Omicron Pi Rho Sigma Tau Upsilon Phi Chi Psi Omega English equivalent a b g d e z h th (q) i k l m n x o p r. exponential smoothing Hypothesis testing Calculus as the derivative Angle of a triangle Poisson distribution. j) ch (c) ps (y) o (w) Use Hypothesis testing. queuing theory Mean value Circle constant Total (upper case). Explanation Value of 1 in the 1st position 1 Value of 1 in the 2nd position 2 Value of 1 in the 1st position value of 1 in the 2nd position 3 Value of 1 in the 3rd position 4 Value of 1 in the 3rd position value of 1 in the 1st position 5 Value of 1 in 3rd position value of 1 in 2nd position 6 Value of 1 in the 3rd position value of 1 in 2nd position value of 1 in 1st position Value of 1 in the 4th position 8 Value of 1 in 4th position value of 1 in 1st position 9 Value of 1 in 4th position value of 1 in 2nd position 10 Decimal Binary 0 1 2 3 4 5 6 7 8 9 10 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 7 Table M-18 Upper case A B Γ Δ E Z H Θ I K Λ M N Ξ O Π P Σ T Y Φ X Ψ Ω Greek alphabet and English equivalent. rh s t u ph (f.Appendix III: Mathematical relationships 447 Table M-17 Relationship between decimal and binary numbers.

This page intentionally left blank .

83% and the probability of less than 600 is 5.00% and the probability of at least 900 is 14. Alternatively. Thus. Solutions – swimming pool Question 3 5 7 Answer Polygon is symmetrical 20% (a) From ogive. Probability of fewer than 600 people using the pool is 5. at the 10% level there will be an estimated to be 640 people – this number is greater than the 600 criterion (c) Probability of at least 600 is 95.17%. Solutions – buyout Question 6 Answer About 7 (7.83%. or and.45) 2. probability of more than 900 using the pool is 14.17% – this is greater than 10%.00% – this is less than 10%. Also. probability of less than 900 is 85.83% .548 million per office About 21 $10. Solutions – closure Question 2 4 5 Answer $4. or a difference of 85. both the conditions are satisfied (b) In Part (a) we have indicated that although the criteria was one or the other. we have shown that both conditions apply.Appendix IV: Answers to end of chapter exercises Chapter 1: Presenting and Organizing Data 1. or a difference of 80.3 million per office remaining 3. Further at the 10% benchmark there will be an estimated 925 people – this is greater than 900.00%.

except Canada where there is no change. France (41. Solutions – purchasing expenditures Question 1 3 4 7 8 Answer Minimum value is €63.450 Statistics for Business 4.680 so use €60. Solutions – European sales Question 4 5 Answer Best three performing countries are England. the US dollar has become stronger.84% of Western Europe) .000. Japan (56) Western Europe (30.04%) Worst performing countries are Ireland. (It takes more of the local currency to buy dollars in 2005 than it did in 2004) 7.923 so use €334. In this way there is an even number of limits 15.000. and Czech Republic with less than 10% of the profits (7. and Spain with over two-thirds of the profit (67.47%) 8. Maximum value is €332. France (59).50% 31 85% €151. Solutions – exchange rates Question 2 Answer For all countries. Denmark. Solutions – Rhine river Question 2 3 4 Answer Logical minimum value is 16 m and logical maximum is 38 m 22% 31% 5. Norway.000 6. Solutions – nuclear power Question 4 5 Answer United States (104).52% of world).

For some reason.84% California 10.54%) 12. Over a quarter are in administration. Kerry 46.69%) France consumes more than twice the amount of pills that is consumed in the United Kingdom 13. Latvia.Appendix IV: Answers to end of chapter exercises 451 9. Bulgaria (2.45). Solutions – electoral college Question 3 4 5 Answer Bush 53.67).33).00) Explains why the textile firms (and other industries) are losing out to China 11. Solutions – textile wages Question 2 4 Answer Times more than China.22). Africa (1. United States (32.16%.03%) No sales recorded for the United States. Solutions – immigration to Britain Question 5 Answer Largest proportion of these new immigrants is from Poland (56. Turkey (6.80). China (mainland) (1.20).22% . Slovakia (6. Serbia. Egypt (1.74%). Solutions – pill popping Question 3 4 Answer France 21% (20. and Lithuania (each with 0.66%). business. the textbook cannot be sold in this country 10.44%). Solutions – textbook sales Question 1 2 4 6 Answer Range of data from maximum to minimum is large so visual presentation of smaller values is not clear Europe (56.02). and management (27.22% 39.27%) England (51. Italy (38. France (40. They are young with over 80% no more than 34-years old.

even though less frequent.32%).88%).46%). route directions poor (11. fruit not clean (9.31%). Solutions – chemical delivery Question 2 3 Answer Drums damaged (32.49%) . boxes badly loaded (22. pallets poorly stacked (10.31%).23%).19%) Drums damaged (32. is a more serious problem Fruit squashed (27. delay – bad weather (15. documentation wrong (21.19%). Not necessarily as bacteria on the fruit (3.07%).02%).452 Statistics for Business 14.73%) 15. Solutions – fruit distribution Question 2 3 Answer Fruit squashed (27. client documentation incorrect (8.

Appendix IV: Answers to end of chapter exercises 453 Chapter 2: Characterizing and Defining Data 1.40% 21.360 3.23 Option 1 15.46 $1.310.02% 4.760 units .585.763 $46.798.55% 6.78/hour $5.04% $1. Solutions – production Question 1 2 Answer 6. Solutions – billing rate Question 1 3 4 Answer $50.885.28 $3.75/hour 2. Solutions – investment Question 1 2 3 4 5 6 Answer 6. Solutions – delivery Question 1 2 Answer $8.

775.287. Solutions – net worth Question 1 Answer 8.45/unit 8. Solutions – Euro prices Question 1 Maximum Minimum Range Average Midrange Median Standard deviation sample s/μ 2 Answer Milk (1l) 1.63 2.406.50% 44. Solutions – construction Question 1 2 Answer 4.52 0.78 Can of coke 1.46 0.61 2.50 0.83 0.00% 16. Solutions – students Question 1 2 Answer 3.34 0.700.00 12.18 0.93 0.00 16.20 18.00 9.82% Finland has some of the highest prices. A Big Mac has the lowest deviation relative to the mean value.05% .49 0. Spain’s prices are all below the mean and the median value.85 0.92% 11.22 0.24 28.71 14.18 Big Mac 3.075.50 2.60 0.24% 14.82 0.49% $143.450.23 Renault Mégane 21.85 18.250.58 17.99 2.30 Stamp for postcard 0.33 0.07 Compact disc 22.79% 4.10 2.56% 13.38 0.76 0.00 16.409 7.454 Statistics for Business 5.73 19.51 0.97 2.81 0.11 0.54 0.57 0. which is an indicator of the lower cost of living 6.98 7.

France (11). (i) 36.22.38. China (10. 4.01 (silver). Italy (10. China (32).Appendix IV: Answers to end of chapter exercises 455 9.63%). Russia (27).97%) 11. France (10. Russia (28. Solutions – printing Question 1 2 Answer $14.09% Data is pretty evenly distributed as the mean.50).9060. Australia (16. Germany (48). Japan (37). (b) 1. Italy (32). 2 modal values.67. Australia (49).77 12. Australia (17).33). Solutions – trains Question 1 2 3 4 5 6 7 8 9 Answer 24.15% 1.00). Russia (8. 1.24 25. Italy (10).40. median. Russia (92). 4.50) 4.48. Cuba (9) United States (35.14 50.01 (gold). China (24.83). Britain (9). This value occurs 5 times 42 24 147. (c) 2.88. France (33). Solutions – summer Olympics Question 1 2 3 4 5 Answer United States (103). Britain (30). 2.00).00). Germany (14).63%).05 (Continued) . South Korea (30) United States (35). (f) 3.40. (d) 2. China (63). South Korea (10.05. (g) 1. and mode are quite close 10.00 25. (h) 0. Germany (15.51. (e) 3.67). 5.38.36 (bronze) United States (11. Britain (9.23 $15.44 12.80.50). Japan (13. Japan (16).33). 2. Solutions – Big Mac Question 1 3 Answer (a) 5.

923. 136.71.80.460.218.75. 211.155.00. 81.25.08.05.456 Statistics for Business 12.613.00.680. (e) 184. 276. Solutions – (Continued) Question 4 5 Answer 0.275. 187.956.00.20. 153. (f) 180. 137.532. 139. 295.46.00. 168.25.01.00 Mean is greater than the median and so the distribution is right-skewed 0% 63. 140.290.088.52. 192.04. 157. 141. 146.153.00.65.02. 151.92.273.60.08.59.433.12.25.68.208.80.85.00.00. 207. 194.50.00. 165. 219. The price for Denmark is within the 4th quartile and indicates a high cost of living country 13.64 25% 143. 108.50.50.20. 225.680. 128.860.097.44.767.30. 133. 263.35.05.227. 129. 181.607.600.64. 248.627.749.209. 156. 187.532.227. 154.00.755.173.823. 150. 1.850.13.333.088.11.387.00.039. 224.201.00. 194. 188.936.976. Solutions – swimming pool – Part II Question 1 Answer (a) 120.943.25. 171.377. 99.005.89.733.84. (e) 797. 148.44.60. 132.86.498.75.00.696.890.50.350.757.888. 257. 196. (i) 55.898.531.18.76.434.00. (i) 109. 110.696.48. 191. 178.333.855. (j) 109.680.713. (g) 194.109.157.080.78. 102.92 The price of the Big Mac in Indonesia is within the 1st quartile indicating a country with a low cost of living.52. 217. 143. 288.715.309. 242. 241. 147. 166.32. Solutions – purchasing expenditures – Part II Question 1 2 4 5 Answer (a) 332.128. 249. 261. (d) 198.00.04.88 100% 332.767. 175.516.243.40.56.132.10. 221. 787.44. 228.32. 235. 161.00. 203. (b) 1.693. (d) 581. 854.56.00.19.105.413. 179.325. 956.117.446. 123. (j) 30.157.60.46.181.558.871. 126. 198. 3 values. 184.82.730.00.360.923.766.376. 711. (l) 507.927.376.934.575.859.64.40.70. 293.472.176.00.32.665.30.00.00.926.972.63 It is reasonably symmetrical 3 (Continued) .154. 113.15.249.75.75.002. The price for Hungary is within the 3rd quartile and the inter-quartile range and has a relative high cost of living.60.302.923.00 14.557.94. (b) 63. 106.66.53. 217.785. 270.815. (h) 869.95.101. 214. 3 times. 161. (k) 0.25.686. (h) 3.153. (m) 142.182.99. (c) 269.50. 246. 238.63.105. 173. 243. (n) 782.60. 118.64.34.28. 180.787.124. 182. (f) 782. 94.44.039. 253. 159. 156.00.06. 176.912. 125. 199.1400.14.78. 204. 223. (g) 787.71 50% 180. 332.59.299.29.92. 86.782.12.16% 63.00.940. 115.66 75% 223.00.532. 307.979.445.36. 185. 163. Singapore is within the 2nd quartile and the inter-quartile range – indicating a relative low cost of living.60. (c) 507. 189. 231. 298. 120.370.00.167.332. 172. 274.571.

9.88 914.59.58.35.64 797.74 848. 6.90 573.23.78.96 763. 8. 9.55 678.12.47.10 741. 10.92. 12.44. 9.44. 10.53 75% 11.52 620. 12.70 927. 14.76. 6.86 779. 9. 12.00 757.67. 11.86.10 0% 5. (g) 11. 9.73 830.04 902. 10.00 542. 6.06 905.35 827. 10. 7. Solutions – buyout – Part II Question 1 2 3 Answer (a) 15.64.05.51.08.76 724.71.48 937. 100% 15.60 839. 10.85.30.58.70.1. 12.10.24 854.00 879. 10.50.55 581.35.30. 11.32 878.30. 9. 9.63.22 777. 10. (b) 5. 9.74. 11. 11.48. 5.71 629.10 650.81.45 795. 7.00 15. 12.38 748.63.14 612.36 846. 7.3.26. 7.88 791. 10.97 619.35.48. 10.53.68.16.17 876. 11. 15. 11.18. 10. 10.39.81. (i) 2.26 726.32 956. 8.92 884.26.60.50.36.03 950.50 790.50.94 844.94.97. 10. 11.98.88 985.21.20 775.54.98 700.04 808.30.13.00 711.10.99 561.30 50% 10. 13.00 869.82 1.74 709.10.50. 14.00 835.16 930.40 816.52 794.35.30. 13.01 870. (f) 10.72. 8.00 790. 7.64 728. 12. (c) 9.76.54.52 607.83 689.20 872.12 781.69.17 669.28 919. (d) 10. 8. 11. 9. (h) 2.57.21. 8. 9.10.64.27 746. 12. 10. 12. 10. 13.62 829.09 652.00 825.45. 13. 10.50. 6. 11.088.20. 8.86.62.56.88 755.50.35. 12. 9. 11.69.016.50.30.Appendix IV: Answers to end of chapter exercises 457 14. occurs 3 times.84 750.80 897. 12.70.59.11. 10.06.13 25% 9.51 731.96. 10. 10.15 1.06 5.80 771.68.40.70.03 751.66 806.23. 11.52.42 810. 7.00 835.78 822. 8. 10.40 760.02.17 765.90 821.58 762.25 Percentile (%) 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Value Percentile (%) 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 Value Percentile (%) 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 Value 715.44 707.8.50.10.00 748.96 663.79 743.97 825.50. 11. 8. 9.46 680. 11. 14. 10.76.60.30.84 683. 8.04 869.30. 11.44 866.81.07 792.86. 8. Solutions – (Continued) Question 4 Answer Percentile (%) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Value 507. 11. 11.00 860. 10. 10. 11.28.27.08 744.89. 9.91.73. 10. 9. 7. 7.59. 11.70. (e) 10.00 691.60 970.00 671.34.65 609.10 . 9.00 678.59 999. 9.29. 12. 12.55 787.

28% 5. Solutions – getting to work Question 1 2 3 4 5 Answer 23.52% 40. Solutions – gardner’s gloves Question 1 2 3 4 5 Answer 35.90% 28.95% 19.458 Statistics for Business Chapter 3: Basic Probability and Counting Rules 1.90% 51.34% 2.50% 46.60% 4. Solutions – packing machines Question 1 2 Answer 38.00% 85.58% 3.16% 25.00% 4.50% 1.76% 3.87% 47.93% (Continued) .81% 11.69% 37.11% 80. Solutions – market survey Question 1 2 3 4 5 6 7 Answer 20.

Solutions – (Continued) Question 3 4 Answer 8.00.11% 44. 77. Solutions – roulette Question 1 2 3 4 5 6 7 Answer 11.08% 16. 88. 55.13% 8.98% 49.00 44. Solutions – sub-assemblies Question 1 2 3 Answer 82.00 6.44%.44%.30% 5.67%. £150.11%.56% 7.92% (Continued) . Solutions – sourcing agents Question 2 3 4 5 6 Answer 38.44%.56% 44. £25.33% 53. 33.33%. 55.10% 3. £125. Solutions – study groups Question 1 Answer 40.Appendix IV: Answers to end of chapter exercises 459 4. 11.89% 22.62% 33.78%. 55.63% 6.86% 17. 66.56% 0%.93% 12.22%.

(b) 116.460 Statistics for Business 8.2051%. Solutions – flag flying Question 1 2 Answer 1. (c) 3.432. 99.00% 1. 4. Solutions – bicycle gears Question 1 Answer Values for each of the three columns are as follows: (A) 1. Solutions – film festival Question 1 2 3 4 Answer 6. 0. 1.520 33.200.80% 11. 3. 4. Solutions – (Continued) Question 4 5 Answer 1.06% 0. 9. 10 (F) 3.040 13. 32 (J) 4. Solutions – assembly Question 1 2 Answer 78.184.580. 8. 28 (I) 4.967.436.9997%.891.0003%. (b) 586. 2. 5. 8 (E) 2. 12 (G) 3. 1 (B) 1. 7.1469% 21.8531% 11. 99.280.08889 8.280.561.051.7949%. (d) 5.3069%. 21 (H) 4. Solutions – workshop Question 1 2 3 4 Answer 30.040.600.02% 9. 99.724. 0. 0.285 (Continued) 1028 (There are 27 countries in the European Union as of 2007) .800 (a) 1. 6 (D) 2. 2 (C) 2. 36 12. (c) 17.20% 10.6931%.297. (d) 1 (a) 5.20% 1. 7.

bain hydromassant 2.152. application de boue marine 8. Solutions – (Continued) Question 3 4 5 Answer 80.04141 1 1064 14. enveloppement d’algues . bain hydromassant 2. massage sous affusion 4. douche oscillante 3. massage à sec 2. douche oscillante 3. application de boue marine 7. douche oscillante 3. massage sous affusion 8. application de boue marine 7.307.Appendix IV: Answers to end of chapter exercises 461 13. douche oscillante 4. the enveloppement d’algues and application de boue marine are not put together. enveloppement d’algues 6. massage sous affusion 4.001.368.730 3.000 70.000 15. hydrojet 8. enveloppement d’algues 7. Solutions – model agency Question 1 2 3 Answer 54.905.320 24 479. bain hydromassant 5. Note. douche à jet 5. massage à sec 1. douche oscillante 6. massage à sec 1.641.674. douche à jet 5. bain hydromassant 2. bain hydromassant 2. they have similarities regarding their benefits 1.600 56 The following is a possible schedule for Question 4.264 1. here that the treatment.959. hydrojet 8. Even though they are different treatments. douche à jet 6. massage à sec 1. massage à sec 1. Solutions – thalassothérapie Question 1 2 3 4 Answer 40. hydrojet 8.

60 accidents/day 3.018.19%) 3.33 There are high occurrences of non-deliveries in the winter months. Solutions – rental cars Question 1 2 3 4 5 Answer 27. The coefficient of variation is 56.12%. Remains the same at €3.161 £1.250 No change No change Laramie coefficient of variation at 7.750 Table 1 1. Also high occurrences in the summer period perhaps because of high volume of air traffic . which might be explained by bad weather.64% which is high £505.750 No change.523. Solutions – HIV virus Question 1 2 3 4 Answer $3. Solutions – road accidents Question 2 3 4 5 6 7 Answer 6. Solutions – express delivery Question 2 3 4 5 6 Answer 16.67% for three packages not being delivered on-time 4. Table 2 0.462 Statistics for Business Chapter 4: Probability Analysis for Discrete Data 1.00 No.69% Data from Table 2 as it shows less dispersion 2.843.7% is smaller than Cheyenne (12.74 Yes.788.69/month 3.631 4. The annual payment is €4.843.50 cars $151.470 £1.

Solutions – gift store Question 1 2 3 4 5 Answer 99. Solutions – investing Question 1 2 3 4 5 6 Answer $137.00 9.75 $194.36% .53% 70. $56.018. 22 He would “lose” £312.00 $97.50.03 6.64% 76.Appendix IV: Answers to end of chapter exercises 463 5.22% 51.00%.31% 0.55% 8.89% 41.73.08% 17.25% 58.94 7. $26. Solutions – bookcases Question 1 2 3 4 Answer 20.50 (opportunity cost) Expected values are long-term and may not necessarily apply to the immediate following month 6. Solutions – European business school Question 4 5 6 7 8 Answer 3.25 25.47% 83. $63.70%.50 $104.

37% (Continued) .39% 17. Solutions – computer Question 2 3 4 5 6 7 Answer 38.73% 13.39% 9 11.32% 12.61% 65.52% 11.68% 12.63% 10.55% 94. Solutions – bank credit Question 2 3 4 5 6 7 Answer 11. Solutions – clocks Question 2 3 4 5 6 Answer 1.19% 24.10% 93.13% 34.37% 95.02% 82.74% 73.83% 82.36% 33.464 Statistics for Business 9.87% 26. Solutions – biscuits Question 2 3 4 Answer 29.

or they may go elsewhere! Provide an incentive to use the cash-for-gas utilization.63% 84.Appendix IV: Answers to end of chapter exercises 465 12. a gift. Solutions – cash-for-gas Question 2 3 Answer No.82% 10. The values are close. This would reduce operating costs by about 50%. If this means a longer wait for customers they may prefer to use the pumps that take credit card purchase. and the method does not just siphon off customers who before used credit card for purchases 15.80% 35. Solutions – bottled water Question 2 3 4 5 6 Answer 4.98% 16.22% 1.52% Only have one attendant at the exit kiosk.04% 13. Since P 5% and sample size is greater than 20 it is reasonable to use the Poisson–binomial approximation .53% 14. However. Solutions – cashiers Question 2 3 4 6 7 8 9 Answer 10.42% They both tail-off rapidly to the right.72% 81.54% 4. Solutions – (Continued) Question 5 6 Answer 83.28% 64. The owner should not be satisfied as the percentage of 12 or more using the pump is only 81.67% 84.91% 4. such as lower prices. this would only be a suitable approach if overall more customers use the service station.

40% 37.07% 8.159 km 2.466 Statistics for Business Chapter 5: Probability Analysis in the Normal Distribution 1. Solutions – training programme Question 1 2 3 4 5 Answer 23.526 Yes.64% About 134 6.46% 12.87% 42. Solutions – cashew nuts Question 1 2 3 4 Answer 42. 126.08% 8. Solutions – renault trucks Question 1 2 3 4 5 6 Answer 47.95 seconds 3.53 g. Solutions – telephone calls Question 1 2 3 4 5 Answer 6.97 g 124.68% 86.08% Minimum 123.39% 6.393 km 258.68% 8.62% 146.74% About 38 and 74 days 4. maximum 129.69 gm .

49 mm 368.912 litre 35.Appendix IV: Answers to end of chapter exercises 467 5.36% (Continued) .50% 9.32% 7.80% 76. Solutions – marmalade Question 1 2 3 Answer 43.23 mm Slender.8765) Just about all of the 19 (18.88% 90.00% 19.50% About 22.95% 4.66% 96. 0. with a sharp peak or leptokurtic as the coefficient of variation is small. Solutions – gasoline station Question 1 2 3 4 5 Answer 4. Solutions – ping-pong balls Question 1 2 3 4 5 6 7 Answer 40.624 369.5460) About 13 (12.6055) 6.1846) Between 1 and 2 (1. Solutions – publishing Question 1 2 3 4 5 Answer Almost none (0.77 and 371.457 litre 7.5549) About 5 (4.20% 8.

78% 9. Solutions – motors Question 1 Answer 56.33% 48.98% 12.63 g 331. Solutions – (Continued) Question 4 5 6 7 Answer About 26.27 336.96 minutes 46.23% 11.37 g 3.09% 22.02% 77.31 minutes 10.91% 925 104.34%) . standard deviation is 19. Solutions – doors Question 1 2 Answer 196 cm About 3% (3.63 and 348.468 Statistics for Business 8. Solutions – yogurt Question 1 2 Answer 13. Solutions – restaurant service Question 1 2 3 4 5 6 Answer Average service time is 125.755.00 minutes.

Solutions – buyout – Part III Question 1 2 3 Answer About 7 (7.41% 58.13% Very little chance (0. Solutions – savings Question 1 2 3 4 Answer 29.93% 89.Appendix IV: Answers to end of chapter exercises 469 13.45) The values are very close When data follows closely a normal distribution it is not necessary to construct the ogives for this type of information .89% 87.04%) 15. Solutions – machine repair Question 1 2 3 4 Answer 2.28% 33.03% 1.75% 14.

Solutions – telephone calls Question 1 2 3 4 5 6 7 Answer 7.19% Since we have a sample of size 10 rather than just a size 1 from the population.15% 48. the percentage decreases as the limits are away from the mean Normal as the population is normal 6 3. In Questions 1 and 3.44% 68.470 Statistics for Business Chapter 6: Theory and Methods of Statistical Sampling 1. data clusters around the mean.29% 39.81% Larger the sample size the data clusters around the mean That the central limit theory applies and the sample means follow a normal distribution 2. Solutions – credit card Question 1 2 3 4 5 Answer 27.81% 1. Solutions – food bags Question 1 2 3 4 5 Answer 26. In Questions 2 and 4.97% 9.68% 88.25% 22.87% 38.76% 71.27% 49.38% With larger sample sizes data lies closer to population mean . the percentage increases as the limits are around the mean.

Solutions – soft drinks machine Question 1 2 3 4 5 Answer 31.35% 33.17% 34.45% 5.36 cl.11% 32.36 cl 1.76 minutes 23. 0. 4. For Questions 4 to 6 the limit values are on either side of the mean values 6.69% of the time 33.06% 98.04% 76.79% 30.Appendix IV: Answers to end of chapter exercises 471 4.14% 99.91% With larger sample sizes data lies closer to population mean For Questions 1 to 3 limit values are on the right side of the mean value.48 cl 5.93% 0. Solutions – financial advisor Question 1 2 3 4 5 6 7 8 Answer 42.15 minutes As the sample size increases there are the probability of being beyond 37 minutes declines as more values are clustered around the mean value The population from which the samples are drawn follows a normal distribution .15 cl. Solutions – baking bread Question 1 2 3 4 5 6 7 8 Answer 11.97% 0.94 minutes 18.

29% €450.07% 163.22% 9.67 cm 0. Note.53% 81.83% 66.28% 89. Solutions – Wal-Mart Question 1 2 Answer 11.5250 87.472 Statistics for Business 7.48% 151.22 cm With increase in sample size data lies closer to the mean value 8. Solutions – automobile salvage Question 1 2 3 Answer €10.86% 75. Solutions – height of adult males Question 1 2 3 4 5 6 7 Answer 5.00 10. the closer is the data clustered around the population proportion.34 cm About 0% 167. Solutions – education and demographics Question 1 2 3 4 5 6 7 Answer 65.33 and 200.66 and 188.45% 75. that in each case the percentages given are on either side of the population proportion .78 and 184.07% 82.35% The larger the sample size.

70% 44.77% (Continued) .78% 84.74% 61.39% 84.11% Larger the sample size.Appendix IV: Answers to end of chapter exercises 473 11. the closer is the data clustered around the population proportion.03% 33. that in each case the percentages given are on either side of the population proportion 12. The sample taken would be biased if it was meant to represent the whole of Turkey 13. Note.63% 68. Istanbul is the commercial area of Turkey and the people are educated. Solutions – World Trade Organization Question 1 2 3 4 5 6 7 Answer 56.26% 78.84% 33. Solutions – female illiteracy Question 1 2 3 4 5 6 7 8 Answer 88. Solutions – unemployment Question 1 2 3 4 5 6 Answer 4.27% 1.43% 95.58% 62.71% 68. the data clusters around the population mean No.52% 45.37% The larger the sample size.16% 86.59% 71.

Solutions – homicide Question 1 2 3 4 5 Answer 0. Solutions – manufacturing employment Question 1 2 3 4 5 6 7 8 Answer 68. Perhaps a more palatable reason could be that the sample was taken from an industrial area of Germany and thus the sample is biased 15.56% 94.65% 34. the closer the values are to the mean .4 or 41. you have a 30 times more chance of being shot in Jamaica than in Britain 0.23% The larger the sample size.64% 43.93% The larger the sample size.474 Statistics for Business 13.61% 31.66% 84.002%. the more the data is clustered around the mean You might say that the German data for the population is an underestimation.06% 0.02% As the limits given are on either side of the population proportion the percentages increase with sample size The limits for France are on the left side of the distribution and as the sample size increases the percentage here declines 14.3 or 30.07% 99. Solutions – (Continued) Question 7 8 9 10 Answer 29.39% 0.

82 and €40. upper limit £12. 552.33 €39.55.47.32 g The higher the confidence intervals. the broader are the limits 4. broader the range 3.355. 652. Solutions – households Question 1 2 3 4 Answer Lower limit £11.74 I estimate that the mean life of these light bulbs is about 473 hours (473.63 The higher the confidence level.71 and 503.53 Lower limit £11. the wider are the limits .67 and €40. the broader are the limits About 3 cases (between 61 and 62 cases) A case of 20 bottles may not be random since they probably come off the production line in sequence 2. Solutions – ketchup Question 1 2 3 4 5 Answer 496. 577.644.02 Higher the certainty. Solutions – light bulbs Question 1 2 3 4 5 Answer 394.045. Solutions – ski magazine Question 1 2 3 Answer €39. upper limit £12.Appendix IV: Answers to end of chapter exercises 475 Chapter 7: Estimating Population Characteristics 1.28. upper limit £12.45 Lower limit £11.18 The higher the confidence level.29 g 495.18.18) and 553 (552.65 294.334.68 and 504.91.46) and I am 80% confident that the mean life will lie in the range 394 (394.37.74) hours 369.

246 More confident you are about information (closer to 100%) the wider is the interval 6.245.310.66 million $6.837.792.970.274.57 million $4.32 million $32.520 I estimate that the mean value of the tax returns for the year in question is $20.64 and upper limit is 16.824.53 million Higher the certainty.82 and $50.47 million (Continued) . broader the range $38. the broader are the limits 8.375.48 and upper limit is 35.954 and $31. Solutions – vines Question 1 2 3 Answer Lower limit is 13. Solutions – lemons Question 1 2 3 4 Answer 91. lower limit is 5.954. Solutions – taxes Question 1 2 3 Answer At 80%.62 and upper limit is 16.96 and upper limit is 27. At 95%.61 and $51.62.12 85.476 Statistics for Business 5.36 bunches.85 million $24.04.38 and upper limit is 31.589. lower limit is 13.36 I estimate that in my vineyard this year I will have an average of 15 grape bunches per vine and I am 95% confident that this number will lie between 13.85 and $53.600.52 mg 86.923.614.38 7.29 and 97. lower limit is 9.954. lower limit is 13.12 and 96.088.820.75 The higher the confidence level.00 and I am 95% confident that they will lie between $9. At 99%. Not really.83 million $32.64 and 16. Solutions – world’s largest companies Question 1 2 3 4 5 6 7 8 Answer $41.276..

571.634.00 €136. In Questions 6 to 9 n/N is 3% so the finite population multiplier is not necessary.45 using z My best estimate of the total value of the tyres in inventory is €155.4175 and 3.083.337. For the first four questions we have a sample size of 35 and a population of 500.083 and I am 95% confident that the value lies between €136.635 €130. Solutions – (Continued) Question 9 10 11 Answer $19. €131.0910 2.5090 and 3.861.73 and $58.59 million Higher the certainty. the mean values determined are probably closer to the population value.86 using Student-t. Solutions – stuffed animals Question 1 2 Answer 0.1485 2.305. Solutions – hotel accounts Question 1 2 3 4 5 Answer 0.081.712.1825 The higher the confidence level. €137. However.5558 and 3. Just selecting tyres from the racks at random may not result in a representative sample 11.74 using z The higher the confidence level.447. Solutions – automobile tyres Question 1 2 3 4 5 6 Answer €155.1175 $3.531. which gives a value of n/N of 7% so we should use the finite correction multiplier. broader the range For Questions 1 to 4 we have used a larger sample size than for Questions 6 to 9 and by the central limit theory. the broader are the limits Use random number related to the database of the tyres in inventory. we have used a Student-t distribution as the sample size is less than 30 9.50 using Student-t. the broader is the range 10.93 and €178.532 and €173.50 (Continued) .81 and €173.0442 2.718.Appendix IV: Answers to end of chapter exercises 477 8.17 and €180.085.22 and €172.

Solutions – night shift Question 1 2 3 4 5 6 Answer 0. Lower limit is 0.0113 0.00% 26.056 54.7802 (Continued) .4726. broader the limits 12. upper limit is 0.645.23% and 73.88 and $3.78 Higher the confidence.621.779.40 or 40% Proportion who say yes: 0.22 and $3.70 or 70.77% 66. Solutions – ski trip Question 1 2 3 Answer 0.6000.4198. upper limit is 0.77% The higher the confidence limits.18% 26.0050 and 0.12 $3.18% 66.23% and 33.0158 0. Solutions – shampoo bottles Question 1 2 3 4 5 6 Answer 0.478 Statistics for Business 11. Lower limit is 0.119 The conservative value requires a very high inspection quantity but since the inspection process is automatic perhaps this is not a problem.82% and 73. Note that this is not a random inspection as the device samples every bottle 13.7274 Proportion who say yes: 0.82% and 33.0177 27.803.0068 and 0. Solutions – (Continued) Question 3 4 5 Answer $3. the broader is the range 14.6000.

691 3. Solutions – Hilton hotels Question 1 2 3 4 5 6 7 Answer 0. the data that is taken can be distorted . Solutions – (Continued) Question 4 5 6 Answer The higher the level of confidence.383 15.13% The higher the level of confidence.12% and upper limit is 76.6531 or 65. the broader are the limits 1.31% Lower limit is 54. the broader are the limits 68 136 There will be hotels in the Southern and Northern hemisphere and since hotel occupancy is seasonal.49% Lower limit is 49.49% and upper limit is 81.Appendix IV: Answers to end of chapter exercises 479 14.

68% which is greater than 2.6771 which is inside the critical boundary limits. Alternatively. The critical value of z is 1.5%.8762.8762. Solutions – neon lights Question 1 2 3 4 5 Answer No.6449 and the test value of z is 1.73 and upper value is 1. The p-value is 4.9600 and the test value of z is boundaries.5% in each tail Yes. This is a two-tail test with 2. the total p-value is 6.480 Statistics for Business Chapter 8: Hypothesis Testing for a Single Population 1.000 g is not contained within these limits The firms should not be under filling the bags as this would not be abiding by the net weight value printed on the label. Alternatively the p-value total is 9.012. Accept the null hypothesis.000 kg bags of sugar. This is a two-tail test with 2.000. Solutions – graphite lead Question 1 Answer No.0%. In this case it appears that the bags are overfilled and this is costing Béghin Say money. At €0.5% Alternatively. Critical value of z is 1. This is a two-tail test with 2. This is equivalent to 6. The p-value in each tail is 3.9600 and the test value of z is 1. The target value of 1.6771.03% which is more than 2. which is outside the critical boundaries.000/year from this production line 2. Critical value of z is 5. Critical value of z is 1.9600 and the test value of z is 1.50/bag this would be a loss of an estimated €3. which is within the critical Yes.5% in each tail (Continued) . the total p-value is 6.06% which is less than the α-value of 10% Lower value is 1.27.0% in the tail 1.8396. This is a one-tail test with Yes.0% in each tail Yes.0% The purchaser can refuse the purchase and go to another supplier.000 g is contained within these limits Yes.03% which is less than 5.68% which is less than 5. As an illustration assume that 1 million of these bags are sold per year and they in fact contain 6 g more of sugar.00% Yes.35% which is greater than the value of 5.6449 and the test value of z is 1.74 and upper value is 1. Thus test value is within the boundaries of the critical value. The p-value in each tail is 4.5% in each tail 1. This is a two-tail test with 5. Solutions – sugar Question 1 2 3 4 5 6 7 Answer No. Alternatively the purchaser can negotiate a lower price 3.011. The target value of 1. Critical value of z is 1.26. The p-value in each tail is 3.06% which is less than the α-value of 5% Lower value is 999.

00% €3. The p-value is 5.8396 and z at 5% is 1. However note that at the 5% level the results between the test and critical value are close 5.5% we accept the null hypothesis Lower limit is 0.09% which is less than 10.5715. which is greater than 1.70 and €3.7168 mm is greater than the hypothesized value of 0.0000. Note this is a two-tail test with 2. This is a one-tail test with Yes.09% which is greater than 5.7018. Critical value of z is 10.6989.0% in the tail.29%. The hypothesis value of 0. The test value of z is 1.7493 and sample test statistic is This gives a p-value of 5.0% No change in the conclusions.29% and since this is less than the value of 5% we reject the null hypothesis. Test value is outside boundaries of the critical value p-value is 3. The hypothesis value of 0.269. upper limit is 0. Critical value of z is 1.00 is not contained within these limits (Continued) .7000 mm is not within these limits Since sample mean of 0.9600.7347.7318.200. This is a two-tail test with 5. which is greater than 5% but less than 10% 1. This is a one-tail test with Yes.6366.5% in each tail The p-value is 2.2816 at 10% 4. Solutions – (Continued) Question 2 3 4 5 6 7 Answer p-value is 3. Solutions – industrial pumps Question 1 2 3 4 5 6 Answer No. Sample mean is now 99. The p-value is 5.7000 mm is contained within these limits Yes. Test statistic is 1. upper limit is 0.200.2816 and the test value of z is 1. At both 5% and 10% limits the null hypothesis should be rejected as test statistic is greater. Reject the null hypothesis. Solutions – automatic teller machine Question 1 2 3 Answer Yes.Appendix IV: Answers to end of chapter exercises 481 3. one-tail test.55% which is less than 5. The critical values are z 1.8154 1.7000 mm it implies that the diameter is on the high side.30.0% but only just Yes. The value of €3. Since this is greater than 2. Sample mean is 99.28% which is less than 2.50%.6449 and 1. Alternatively we can say that the p-value is 4.0% in each tail Lower limit is 0.80%.9600 and the sample test statistic is 2. The hypothesis testing could be performed with a smaller sample size which would be less expensive.0% in the tail 1.8396 and the critical value of z is now 1. Thus we would use a right-hand.6449.6449 and the test value of z is 5.6366.

This is a two-tail test with Yes.50% in each tail The p-value is 2.6955 and we accept the null hypothesis.4142 or inside the critical 7.37 and 1. The nominal value of 1.5% in each tail 1. we can say the p-value is 4.5% 996.6449 and the test value of z is 5. This is a one-tail test with 7 . At a significance level of 10% the Student-t value is 1.9600 and the test value of z is 1.28% which is greater than 0.280.50% for one tail. Thus there is no change to the conclusions from Questions 1 to 4 1.0% in the tail Yes.2816 and the test value of z is limit. if the bank puts significantly too much cash in the machines that is underutilized this money could be invested elsewhere to earn interest 6.5758. Critical value of z is 2.0% Yes. Note this is a two-tail test with 0.92 and €3. Critical value of z is 1.189.55% which is greater than 1.54 cm. one-tail test with 5.0% From the data.4142 or outside the critical 1.6680.3095 and we reject the null hypothesis. This is a left-hand. Solutions – (Continued) Question 4 5 6 7 Answer No.86% which is less than 10. The critical value of z is 1. The p-value is 7. The p-value is 4. The p-value is 4.08. one-tail test with 10. Alternatively.00% €3.6449 and the test value of z is limit.77% which is greater than 2. Alternatively.0% If as the test suggests there is significantly less than the 1.000.000 ml as indicated on the label. This is a left-hand. Solutions – salad dressing Question 1 2 3 4 5 6 Answer No.200 is contained within these limits If the bank has significantly less in the machines there is a risk of the machines running out of cash and this is poor customer service.000 ml is contained within these intervals Yes.77% which is less than 5.67%.0% in the tail Yes.3937 and a p-value of 8. The sample test value is still 2.6680.0000 but the critical values are z 2.29 ml. Solutions – bar stools Question 1 2 3 4 5 Answer No. The p-value is 7.86% which is greater than 5. the sample standard deviation is 2. which gives a sample statistic of 1. the manufacturer is not being honest to the customer and they could be in trouble with the control inspectors Very sensitive as small changes in volume will swing the results either side of the barrier limits 1. At a significance level of 5% the Student-t value is 1. The value of €3. The critical value of z is 1.0% in the left-hand tail Yes. The test value is within these boundaries.482 Statistics for Business 5.

0% 1. This is a one-tail test with .7613 and the test value of t is 5.0639 and the test value of t is 1.68 minutes.7109 and the test value of z is 5.79% which is less than 5.92 and 15. Critical value of t is 1.0% 195.58% which is greater than 5.9424.07% and since this is less than the value of 5% we reject the null hypothesis In Questions 1 and 2 we are testing to see if there is a difference thus the value of α is 5% but there is only 2. Solutions – hospital emergency Question 1 2 3 4 5 6 Answer No.1448.39% which is greater than 5. Accept the null hypothesis. Thus test value is within the boundaries of the critical value p-value is 6.5% in each tail which gives a relatively high value of t.15% and since this is greater than the value of 5% we accept the null hypothesis Yes.0330 and the critical value of t is now 1.1448 and the test value of t is 2. In Questions 3 and 4. This is a two-tail test with Yes. Accept the alternative hypothesis.0859.0% in the left-hand tail Yes.0% in the right-hand tail Yes. the hospital is not meeting its objectives with the risk to the life of patients 2. Thus t is smaller than for the two-tail test and explains the shift from acceptance to rejection 10. If this is the case. we are still testing at the 5% significance level but all of this area is the right tail.7613. The p-value is 3.0859. The test value of t is 2.Appendix IV: Answers to end of chapter exercises 483 8.15 g.05 and 200. The p-value is 5. This is a two-tail test with Yes.0% 9. Solutions – apples Question 1 2 3 4 5 Answer No. Critical value of t is 2. Critical value of t is 1. Thus test value is outside the boundaries of the critical value – it is higher p-value is 3. This is a one-tail test with 9. Critical value of t is 2.0% The test for being greater than 10 minutes. The value of 10 minutes is contained within these intervals Yes.5% in each tail 2. Solutions – batteries Question 1 2 3 4 5 Answer No.9424. The test value of t is 2.5% in each tail 2.0330 and the critical value of t is 2.20% which is less than 5. The value of 200 g is contained within these intervals Yes. The p-value is 6. The p-value is 2.

7711 2 3 4 5 6 7 In each tail the p-value is 3. Alternatively the p-value total is 1. Accept the null hypothesis. The value of 34.81% and 33.0% in each tail 1. The value of 45% is within these limits so verifies that we should accept the null hypothesis Reject the null hypothesis.06% and 42.5%.5% in each tail 1.69% which is greater than the 0. Alternatively the p-value total is 7. Solutions – equality for women Question 1 2 3 4 5 6 7 Answer Accept the null hypothesis.5% in each tail In each tail the p-value is 37.45% and 45. This is a two-tail test with 2.5%.5%. Alternatively the p-value total is 75.28% and 35.19%. which is less than the 5. At the significance level of 5% indications are that the pay gap is more than 45% 12.5%.83% which is greater than 2.7711 which is inside the critical boundaries.20%.02% is not within these limits so verifies that we should reject the null hypothesis From the table proportion Poland imports from Russia is 86. This is a two-tail test with 2. This is a two-tail test with 0.484 Statistics for Business 11. Solutions – gas from Russia Question 1 Answer From the table proportion in Italy imports from Russia is 34.6449 and the test value is which is outside the critical boundaries.0%.9600 and the test value is 1. This is a two-tail test with 2. The critical value of the test is 1. The value of 34.65% which is greater than the α-value of 5% Limits are 4.5% in each tail In each tail the p-value is 3.38% which is greater than the α-value of 1% Limits are 15. This is a two-tail test with 5.66%.83%.65% which is less than the α-value of 10% Limits are 6. The critical value of the test is 1.4637.3074 which is inside the critical boundaries.05%.85% which is greater than the α-value of 5% (Continued) 8 .72%.4637.9600 and the test value is In each tail the p-value is 0. Accept the null hypothesis.5758 and the test value is In each tail the p-value is 0. The critical value of the test is 1. The critical value of the test is 2.81%. The critical value of the test is 2.38% which is less than the α-value of 5% Limits are 19.69% which is less than the 2. Alternatively the p-value total is 7. Alternatively the p-value total is 1.9600 and the test value is 0. The value of 45% is outside these limits and so verifies that we should reject the null hypothesis At the significance level of 1% the test would support the conclusions.5% in each tail 2.93% which is greater than 2.20% is within these limits so verifies that we should accept the null hypothesis Reject the null hypothesis.

0% is within these limits so verifies that we should accept the null hypothesis Reject the null hypothesis.65% and 42.84% which is greater than the α-value of 1% Limits are 16. The value of 86. The critical value of the test is 1.5% in each tail In each tail the p-value is 1. Alternatively the p-value total is 75.93% which is greater than 5.98%.0%. It perhaps can easily call on imports from say Libya or other North African countries. The value of 86.2546 which is inside the critical boundaries. The value of 11.5%.3074 which is inside the critical boundaries. Solutions – international education Question 1 Answer Accept the published value or the null hypothesis.84% which is less than the α-value of 5% Limits are 19.17%. This is a two-tail test with 0. The critical value of the test is 2.97% and 34. Alternatively the p-value total is 2.5% in each tail 2.0% in each tail In each tail the p-value is 37.9600 and the test value is which is outside the critical boundaries. This is a two-tail test with 2.5%.92% which is greater than 0.42% which is greater than the α-value of 1% Limits are 9.21% which is greater than 0.6449 and the test value is 0.78%. This is a two-tail test with 0. This is perhaps because historically and geographically Poland is closer to Russia and is not yet in a position to exploit other suppliers 13.92%.47%. The value of 19. Alternatively the p-value total is 3.36% and 99. On the other hand Poland seems to be using close to or more than the contractual amount of natural gas from Russia. The critical value of the test is 1.Appendix IV: Answers to end of chapter exercises 485 12. The critical value of the test is 2. Solutions – (Continued) Question 9 10 11 12 13 Answer Limits are 77. This is a two-tail test with 5.16% and 97.5758 and the test value is 2. The value of 19.81% is within these limits so verifies that we should accept the null hypothesis Accept the null hypothesis.5% in each tail In each tail the p-value is 1. Alternatively the p-value total is 3.5758 and the test value is 2.0710 2 3 4 5 6 7 In each tail the p-value is 1.57%.85% which is greater than the α-value of 10% Limits are 79.0710 which is inside the critical boundaries.81% is within these limits so verifies that we should accept the null hypothesis From this data it appears that Italy is tending to use less than the contractual amount of natural gas from Russia.0% is not within these limits so verifies that we should reject the null hypothesis Accept the published value or the null hypothesis.05% and 46. which is less than 2.5% is within these limits so verifies that we should accept the null hypothesis (Continued) 8 9 .5%.

9998 which is outside the critical boundaries. This is a two-tail test with 5.28% which is less than 5.6449 and the test value is 1. This is a two-tail test with 2.28% which is greater than the α-value of 5% Limits are 0.6449 and the test value is 0. Solutions – (Continued) Question 10 11 12 Answer Reject the null hypothesis.90% is within these limits so verifies that we should accept the null hypothesis Reject the published value or the null hypothesis.55% which is less than the α-value of 5% Limits are 4. The critical value of the test is 1.54%.5%. The critical value of the test is 1.18% and 9.5% in each tail In each tail the p-value is 1.5%.9600 and the test value is 2.5% in each tail In each tail the p-value is 2.5% is not within these limits so verifies that we should reject the null hypothesis 14. Alternatively the p-value total is 97.0%. The value of 4. Alternatively the p-value total is 4.0341 which is inside the critical boundaries.21%. The critical value of the test is 1.42% which is less than the α-value of 5% Limits are 12.90% is within these limits so verifies that we should accept the null hypothesis Accept the published value or the null hypothesis.486 Statistics for Business 13.5%.0% in each tail In each tail the p-value is 2. The value of 11.64% which is greater than 2.28% which is less than 2. Alternatively the p-value total is 2.9998 which is outside the critical boundaries.46%.28% which is greater than the α-value of 10% Limits are 0.55% which is less than the α-value of 10% (Continued) 2 3 4 5 6 7 8 9 10 11 . Alternatively the p-value total is 4. which is less than 2. The critical value of the test is 1.0341 which is inside the critical boundaries.2546 which is outside the critical boundaries. This is a two-tail test with 2.0% in each tail In each tail the p-value is 48.90% and 31. The value of 4.9600 and the test value is 1.0%.5% in each tail In each tail the p-value is 48. The value of 4.28%. This is a two-tail test with 10.92% and 8. Solutions – US employment Question 1 Answer Accept the published value or the null hypothesis. This is a two-tail test with 2. Alternatively the p-value total is 97.64% which is greater than 5.9600 and the test value is 0.72%.99% and 14. The critical value of the test is 1.90% is not within these limits so verifies that we should reject the null hypothesis Reject the published value or the null hypothesis.

People are older and do not have flexibility to move from one industry to another 15.54%.37%.00% is not within these limits so verifies that we should reject the null hypothesis Reject the published value or the null hypothesis. This is a two-tail test with 2.90% is not within these limits so verifies that we should reject the null hypothesis The data for Palo Alto indicates low levels of unemployment.22% and 29.6449 and the test value is 1.0%. Alternatively the p-value total is 0% which is less than the α-value of 5% Limits indicate 9.7059 which is well outside the critical boundaries. The critical value of the test is 1.5%. The critical value of the test is 1. Solutions – (Continued) Question 12 13 Answer Limits are 5.5% in each tail In each tail the p-value is 2. This is a two-tail test with 5.6449 and the test value is 8. The value of 4.81% which is less than 5.81% which is greater than 2. This is a two-tail test with 10.74% and 13.9600 and the test value is 8.30%.29% and 31.23%. The value of 31. This region is a service economy principally in the hi-tech sector.0% in each tail In each tail the p-value is 2. The critical value of the test is 1.5% in each tail In each tail the p-value is 0% which is less than 2. The value of 31.5%. Solutions – Mexico and the United States Question 1 2 Answer 0.Appendix IV: Answers to end of chapter exercises 487 14. The value of 60% is far from being within these limits so verifies that we should soundly reject the null hypothesis Reject the published value or the null hypothesis.9102 which is inside the critical boundaries.0% in each tail (Continued) 3 4 5 6 7 8 9 10 11 .61% which is greater than the α-value of 5% Limits are 7.9102 which is outside the critical boundaries. Detroit is manufacturing and this sector is in decline. Alternatively the p-value total is 5. The critical value of the test is 1. This is a two-tail test with 2.61% which is less than the α-value of 10% Limits are 9.21% (0%) and 16.00% is within these limits so verifies that we should accept the null hypothesis Reject the indicated value or the null hypothesis.9600 and the test value is 1.7059 which is well outside the critical boundaries.10% Accept the indicated value or the null hypothesis. Alternatively the p-value total is 5. People are young and are more able to move from one company to another.

and undocumented alien is unlikely to be truthful and nothing says that the random sample were documented. To an interviewer. Taking a random sample of foreign-born individuals would be biased. or undocumented. Solutions – (Continued) Question 12 13 14 Answer In each tail the p-value is 0% which is less than 5. Exercise underscores difficulty of obtaining reliable information about illegal immigrants .16% (0%) and 14.18%.488 Statistics for Business 15.0%. The value of 60% is not within these limits so verifies that we should reject the null hypothesis Random sampling for verifying that an individual is from Mexico can only be done from census data. Alternatively the p-value total is 0% which is less than the α-value of 10% Limits indicate 7.

88% 3. The conclusions do not change.00% in the tail.8816 Significance level is 5% and the p-value is still 0.3733.8816 μ2.5758 and the sample test statistic is 2. the critical value of z is 2. Solutions – tee shirts Question 1 2 3 4 5 6 7 8 Answer Null hypothesis is H0: μS μI. This is a one-tail test Yes. At 5% significance. At 1% significance.1979% No.9600 and the Significance level is 1% with 1. At 5% significance. where μS is the mean value for Spain and μI is the mean value for Italy. the critical value of z is Accept the null hypothesis 2. At 2% significance. The p-value is still 0. Alternative hypothesis is H1: μS μI.1979% In 2006 the price of crude oil rose above $70/barrel. Thus.88% Yes. where μS is the mean value for Spain and μI is the mean value for Italy.4320 (Continued) . where μ1 is the mean value 2. the critical value of Student-t is 2. Alternative hypothesis is H1: μS μI. where μF is the mean value for stores using FAX and μI is the mean value for those using Internet. The conclusions change. At 1% significance.4999 and the sample test statistic t is 2.3733. The p-value is still 0. Reject the null hypothesis Significance level is 5% with 2. Solutions – inventory levels Question 1 2 Answer Null hypothesis is H0: μF μI. At 1% significance. this is a one-tail test No. the critical value of z is Reject the null hypothesis 2. Alternative hypothesis is H1: μF μI.88% Null hypothesis is H0: μS μI. Gasoline is a refined product from crude oil 2.9600.50% in each tail. 1.3733.3263 and the sample test statistic is 2. This is a two-tail test No. Alternative hypothesis is H1: μ1 in January 2006 and μ2 is the mean value in June 2006 Yes. The sample test statistic remains at 2.Appendix IV: Answers to end of chapter exercises 489 Chapter 9: Hypothesis Testing for Different Populations 1. Significance level is 1% with 0.3263 and the sample test statistic is Significance level is 2% and the p-value is 0. the critical value of z now becomes 1.5% in each tail. the critical value of z is now sample test statistic is still 2. Solutions – gasoline prices Question 1 2 3 4 5 6 Answer Null hypothesis is H0: μ1 μ2. The p-value in each tail is 0.

3436.8125 (Continued) . Perhaps FAX orders sit in an in-tray before they are handled.9037. Solutions – (Continued) Question 3 4 5 6 Answer Significance level is 1% and the p-value is 1. Alternative hypothesis is H0: μD μS 1.16% The statistical evidence implies that when orders are made by FAX. However Internet orders are normally processed immediately as they would go into the supplier’s database 4. Since we are asking is there a difference.000/store.7404 and critical Student-t is 2. those stores using FAX keep a higher coverage of inventory in order to minimize the risk of a stockout.5% in the tail and the sample p-value is 0. accept the null hypothesis.000/store Null hypothesis H0: μ1 μ2 £26. the delivery is longer.7404 but critical Student-t is now 1. This is a one-tail test Significance level is 1% and the sample p-value is 0.8784 and the sample test statistic t is 4.87% (a one-tail test) The investment is too much to install a database system.490 Statistics for Business 3.7638 Significance level is 1% and p-value is 1. For this reason. Thus benchmark is an increase of 10% or £26.000 No. Sample Student-t is still 2. Solutions – sales revenues Question 1 2 3 4 5 Answer Average store sales of those in the pilot programme before is £260.04% H0: μD μS 1. At a 1% significance level. Significance level is 1% with 0. the critical value of Student-t is 1.04% Yes. this is a two-tail test No. It could be that they prefer the manual system as they can more easily camouflage real revenues for tax purposes! 5. This is now a one-tail test 2 3 4 5 6 7 Using this new criterion we reject the null hypothesis.000. The critical value of Student-t is now 2. Alternative hypothesis H1: μ1 μ2 £26.7139 and the sample test statistic t is still 2.5524 and the sample test statistic t is still 2.4320 Significance level is 5% and the p-value is still 1. Alternative hypothesis is H0: μD μS. At 5% significance. Sample Student-t is 2.25 μS.25 μS. The critical value of Student-t is 2. where μD is the mean value using database ordering and μS is the mean value for standard method. Solutions – restaurant ordering Question 1 Answer Null hypothesis is H0: μD μS.16% Yes.

3509 and critical Student-t is 1. Sample Student-t is 1.1081 Significance level is 15% and p-value is still 13.8214 Significance level is 1% and sample p-value is 36. note that a high significance level is used before we can say that the objectives have been reached 7.1959 but critical Student-t is now 1. Also eliminating coffee may reduce or increase stress or may reduce or increase consumption of sugar so all variables have not been isolated 7 . Alternative hypothesis H1: μ1 μ2 12.20. Sample Student-t is still 0. Sample Student-t is 0.3509 and critical Student-t is 2.8965 Significance level is 1% and p-value is 13.30% That depends if these new yield rate levels are sustainable and if the advertising cost and the reduction in price levels have led to a net income increase for the hotels. Sample Student-t is still 1. Alternative hypothesis H1: μ1 μ2 10% No.7 headaches per month. Null hypothesis H0: μ1 μ2 12.69% About a 60% reduction would be required.69% No.8257 and a p-value of 0.Appendix IV: Answers to end of chapter exercises 491 5.99% Patients may experience less attacks knowing they are undertaking a test (physiological effect). Solutions – (Continued) Question 6 7 Answer Significance level is 5% and p-value is 1. At a 60.04% Even at 1% significance the results are very close to the critical test level.4 and 50% of this is 12. Solutions – hotel yield rate Question 1 2 3 4 5 6 Answer Null hypothesis H0: μ1 μ2 10%.3830 Significance level is 10% and sample p-value is still 36. Retail sales are very sensitive to seasons so we need to be sure that the sampling data was carried out in like periods 6.20.30% Yes. This gives a sample Student-t of 2. Solutions – migraine headaches Question 1 2 3 4 5 6 Answer Monthly average of the migraine attacks before is 24.28% reduction this gives a mean value afterwards of 9.20 No. Also.1959 and critical Student-t is 2.

3263 0. Alternative hypothesis is H0: p05 year the test was made. The indices refer to the (Continued) . Solutions – World Cup Question 1 Answer Null hypothesis is H0: p98 p06. Alternative hypothesis is H0: p6 the test was made. The fitness craze seems to have leveled off in this decade. Sample z-value is 1.6590 and critical z-value is 1. Sample z-value is still 2. Sample z-value is still 1. This is a two-tail test p06. The indices refer to the year Yes. The indices refer to the Reject the null hypothesis.5758 Significance level is 1% with 0. This is a one-tail. This is a one-tail test p96. Solutions – flight delays Question 1 2 3 4 5 6 7 8 9 Answer Null hypothesis is H0: p05 p96. Sample z gives 4. Other variables that need to be considered in the sampling are the gender of customers.492 Statistics for Business 8.9600 Significance level is 5% with 2.4972 and critical z-value is 1. Sample z gives 0.6590 and critical z-value is now 1. Accept null hypothesis.63% in each tail Yes.5% in each tail.5% in each tail. Certainly though airplane traffic has increased significantly and delays have increased. In summer traffic is heavier.86% in the tail The data results are close. and whether the customers at the hotel are for business or pleasure? 9. there are weather problems.63% or the p-value of the sampling experiment The experiment could be defined better. The indices refer to the year p1.4972 and critical z-value is 2. Alternative hypothesis is H0: p6 the test was made.5% in each tail.9600 Significance level is 5% with 2. Alternative hypothesis is H0: p05 year the test was made. right-hand test p1. Reject null hypothesis. This is a two-tail test No.86% in each tail Null hypothesis is H0: p6 p1. What is a major European airport? When was the sampling made – winter or summer? In winter. This is a two-tail test p96.63% in each tail Null hypothesis is H0: p05 p96. Alternative hypothesis is H0: p98 year the test was made.4972 and critical z-value is 2. The indices refer to the No. Sample z-value is 2. Solutions – hotel customers Question 1 2 3 4 5 6 7 Answer Null hypothesis is H0: p6 p1.6449 Significance level is 5%. Sample z gives 4. Sample z-value is 2. Sample z gives 0. A 20-minute delay may be excessive for some passengers 10.

74% in each tail Null hypothesis is H0: p06 p98.4385 and critical z-value is 1.85% Yes.2767 and sample chi-square value is 9. Note.74% The experiment should be defined. Alternative hypothesis is H0: p06 p98.5% in each tail.5758 Significance level is 1% with 0. We could also say.Appendix IV: Answers to end of chapter exercises 493 10. Sample z gives 0. Critical chi-square value is 13.5% in each tail. Percentage returns are 93% (186/200) Stress may not only be a result of travel but could be personal reasons or work itself 12. Not all European countries have a team in the World Cup 7 8 9 11. It seems that the investment strategy of individuals is independent of the country of residence. What was the gender of the people sample? (Men are more likely to be enthusiastic about the World Cup than omen.) The country surveyed is also important. Solutions – (Continued) Question 2 3 4 5 6 Answer No. Sample chi-square is 10. H0: p98 p06.4385 and critical z-value is 2. Alternative hypothesis H1: pHS indices refer to the stress level pMS pI pLS where the No. Alternative hypothesis H1: pU indices refer to the first letter of the country pG pI pE where the No.5604 Significance level is 5% and p-value is 4.3263 Significance level is 1.5604 Significance level is 1% and p-value is 4.9108 and critical chi-square is 11. Solutions – travel time and stress Question 1 2 3 4 5 6 7 Answer Null hypothesis H0: pHS pMS pLS. Reject null hypothesis.00% and p-value is 0. Sample z gives 0. Solutions – investing in stocks Question 1 2 3 Answer Null hypothesis H0: pU pG pI pE.85% Yes. This is a one-tail test.9600 Significance level is 5% with 2. Alternative hypothesis is H0: p98 p06 though this does not fit with the way the question is phrased Reject the null hypothesis.74% in each tail Yes. Sample z-value is 2.22% (Continued) .4385 and critical z-value is 2. Critical chi-square value is 9. Sample z-value is still 2. The indices refer to the year the test was made. Accept null hypothesis. Sample z-value is 2.4877 and sample chi-square value is still 9.3449 Significance level is 1% and p-value is 1. here we have reversed the position and put the year 2006 first.

Critical chi-square value is 13. Sample chi-square is 10.22% According to the sample. Solutions – automobile preference Question 1 2 3 4 5 Answer Null hypothesis H0: pG pF indices refer to the country pE pI pS.494 Statistics for Business 12. Alternative H1: pG pF pE pI pS where the Yes. Solutions – (Continued) Question 4 5 6 Answer Yes.00% and p-value is 6.5675 Significance level is 10. It seems that the investment strategy of individuals is dependent on the country of residence.5073 and sample chi-square value is 14. the newspaper that is read.3616 and sample chi-square value is 14.5675 Significance level is 5. Solutions – newspaper reading Question 1 2 3 4 5 6 Answer Null hypothesis H0: p1 p2 p3 p4 indices refer to the salary category p5. Critical chi-square value is 15. Critical chi-square value is 31.9999 and sample chi-square value is 38. nearly 62% of individuals invest in stocks in the United States while in the European countries it is between 50% and 52%.12% the p-value of the sample data Results are not surprising.81% There are many variables in this sample experiment for example the country of readership. The higher level in the United States is to be expected as Americans are usually more risk takers and the social security programmes are not as strong as in Europe 13.9473 Significance level is 3% and p-value is still 1.9108 and critical chi-square is 8. Alternative H1: p1 p2 p3 p4 p5 where the No. Even though we have the European Union people are still nationalistic about certain things including the origin of the products they buy 14. Depending on the purpose of this analysis another sampling experiment should be carried out to take into account these other variables .12% 0.7764 Significance level is 1% and p-value is 0. and the profession of the reader.00% and p-value is 6.81% Yes.

00% since the p-value of the sample is 2. It seems that the amount of wine consumed is independent of the country of residence.63% 3. Sample chi-square is 23. Solutions – wine consumption Question 1 2 3 4 5 6 Answer Null hypothesis H0: pE pF pI pS pU. and California are up – at the expense of French wine! .7418 In Europe wine consumption per capita is down but wine exporters such as South Africa.2170 Significance level is 1% and p-value is 2. Australia. Alternative H1: pE indices refer to the first letter of the country pF pI pS pU where the No.Appendix IV: Answers to end of chapter exercises 495 15.1757 and critical chi-square is 26.63% 22. Chile.

9002x This might correspond to increased purchases due to back-to-school period r2 is 0.9649 A reduction of about 246 deaths per year 4. Coefficient of correlation is 0.000 hours worked. Solutions – safety record Question 2 3 4 5 6 7 Answer y ^ 179. Note that the coefficient of correlation has the same negative sign as the slope of the regression line 0.983 (Continued) 4 5 .000 hours worked 0.000 hours worked 2.496 Statistics for Business Chapter 10: Forecasting and Estimating from Correlated Data 1.4441 0.16 reported incidents per 200.000 hours worked The coefficient of determination. Using the linear regression equation to forecast for 2009 already shows a negative 0. Solutions – office supplies Question 2 3 4 5 6 7 Answer y ^ 17.11 reported incidents per 200.9391 or the coefficient of correlation of 0.6883 0.057. We cannot have negative reported incidents of safety It suggests that the safety record of ExxonMobil has reached its minimum level and if anything will now decline in an exponential fashion.895 incidents per 200. From an operating point of view it may be better to report unit sales in order to see those items which are the major revenue contributors 3.261 and December 2009 is £93.064 Forecasting using revenues has a disadvantage in that it covers up the increases in prices over time.9691. r 2.9388 and r 0.620. Here the value is 0. Solutions – road deaths Question 2 3 Answer y ^ 498.701 increase per quarter June 2007 is £66.9309. This is most reliable as it is closer period to actual data.9689 which indicate a good correlation so there is a reasonable reliability for the regression model £2. r 2. the regression line seems to be a reasonable predictor for future values.49 245. which is 0. December 2008 is £82.0895x An annual reduction of 0.59x using the years as the x-values The coefficient of determination.07 incidents per 200. Since it is quite close to unity.

8627 or coefficient of correlation. 5 is achievable with a lot of effort on the part of drivers and national and local authorities.9393. Clients change their minds 8 .5455x 55. For the data given they are greater than 0.0000 Coefficient of determination is 0.7598x 0. Further. A value of 20 for x is beyond the range of the collected data.8822 and the coefficient of correlation is 0. r 0. Solutions – carbon dioxide Question 1 2 3 4 5 6 7 Answer y ^ 28.406 millions of metric tons of carbon per year Worldwide there is a lot of discussion and action on reducing carbon dioxide emissions.4 minutes 61.Appendix IV: Answers to end of chapter exercises 497 3. and different cooking conditions (Rare. externally many clients would not wait over an hour to get served People order different meals.54 millions of metric tons of carbon per year 2.043 and 2. Kitchen staff not polyvalent so bottlenecks for some dishes. and well cooked steaks).8088 2.9288 2. To be correct we use the Student-t distribution.8 and indicate a reasonable relationship 28. 2.0 minutes. The value of 71 for Question No. r 2 y ^ 5. These values indicate the danger of making forecasts beyond the collected data and way into the future 4. 6 is desirable but probably unlikely. Planning not optimum in the kitchen or in the dining room.8 minutes 22.983 for Question No. Solutions – (Continued) Question 6 7 Answer 71 The value of 4. Solutions – restaurant serving Question 2 3 4 5 6 7 Answer Coefficient of determination. This would be an unacceptable internal operating situation (36% of employees are absent).8 minutes for every employee absent 5.120 millions of metric tons of carbon per year Between 2. so that the rate of increase may decline and thus making forecasts using the given data may be erroneous 5.198 million metric tons of carbon equivalent. medium.256.

r2 ^ be maintained indefinitely 0.536x 11.9156.8681. of Yam ^ parties. r2 is 0.28 * advertising budget 745.9099 14.65 * No. r2 is 0.498 Statistics for Business 6. of sales persons.4 million per year (Continued) .21 2. Solutions – product sales Question 2 3 4 5 Answer Regression equation is y ^ 28 units/day 65 units/day. However this profile cannot 57.6512. relationship is not strong 1.835 and upper limit is £589.742.7849x. Solutions – German train usage Question 2 4 5 6 Answer Regression equation is.6319) y 0.2909x 8. r2 0.944.649 Lower limit is £381. Coefficient of multiple determination r2 is 0. y ^ Regression equation is.7829 3. Coefficient of ^ multiple determination r2 is 0.40 * No.799 and upper limit is £588.171 9.463 y 205.2459 2.9650.21 2.9640 which indicates a strong relationship £485. Solutions – hotel revenues Question 1 2 3 Answer y ^ r2 7. Is it realistic for the store? The extent of product exposure has an impact on sales 10. relationship is strong 8.981.987. The relationship is strong 7.9483 a strong relationship. a strong relationship £477. The shelf space of 5 m2 is beyond the range of the data collected.79 * sales persons 379. Solutions – cosmetics Question 1 2 3 4 5 6 Answer y 230.0706x2 0.31 * advertising budget 732.400x 0.5054.60 $7.485 Lower limit is £366.5570x. y ^ About 21 million (20.

the more difficult it is to make a forecast 10.8942.248.6424 €348.55 million 428.047.6727x ^ 224.387.11 million (Continued) . Hershey is expanding their market into Asia so that it should help to keep up their revenues 11.230.090 $79.9522.61 and $100.90 million 2008 is closer to the period when the data was measured and so is more reliable. Since it is greater than 0. Lower limit €191. r2 is 0.85 million €687.8 the regression line is a reasonable predictor for future values $2.193.70 million $205.Appendix IV: Answers to end of chapter exercises 499 9.9275 273. Solutions – (Continued) Question 4 5 6 7 8 9 10 Answer $116. Hershey has been around for a long time.70 million and upper 130. Solutions – Hershey Question 1 3 4 5 6 7 8 9 Answer The number of shares more than doubles in mid-1996 and mid-2004 indicating a stock split in those periods y ^ 732. r2 0.94 A severe economic downturn which could depress stock prices Probably not significantly reduced.012. People will continue to eat chocolate and related foods. Lower limit $102. The further away in the time horizon that forecasts are made.930.29 million y ^ 1.931/year $90.40 million. r 2.6182.9722 3.1091. Solutions – compact discs Question 2 3 4 5 6 Answer y 15.9712x 3.053.60 million. Here the value is 0.7610 The coefficient of determination.50 million and upper 219.7705x 4.7652x2 0.7656x 5169. closer to unity 31.028.0114x2 4.30 million y ^ r2 0.

Thus using linear regression based on earlier period data is not useful.264. In fact sales of CDs are now declining with Internet downloading of music and further.855. y ^ 1965–1969.40x 134.488.573. Also.8368 4 5 6 7 8 9 10 0.861.292.721. Even though the coefficients of regression determined in Question 2 show a strong correlation for the time series data. 12.40.733. The curve developed using the polynomial function indicates a close fit and this is corroborated by the coefficient of determination. y ^ 1995–1999.860.829. exponential: 3. will soon be replaced by other technologies.99 million. r2 0. Note the value obtained depends on how many decimal places are used for the coefficient.088. forecast for 2006 is $61.00.023. r2 0.60 million Using the actual data for 1965–1969.50x 2. forecast for 2006 is $1. forecast for 2006 is $962. 12.638. y ^ 1. From this we might conclude that the forecast developed from this relation for 2010 is the most reliable .6840x 64.080. p.20.3373. there is evidence that the CD.9954 35.0000e0. Solutions – US imports Question 2 Answer 1960–1964.832 $2.609.10. Answers to Question 6 and 7 use eight decimal places Using polynomial relationships implies that we can reach infinite values. r2 1.885. forecast for 2006 is $1.9767 3 Using the actual data for 1960–1964.9700 38. Solutions – (Continued) Question 7 8 Answer 844.675.238.039. r2 0.737.7090x 0.60 million Using the actual data for 1995–1999.034.791.60x 350. 17 August 2007.998.057.40. you must know the business environment – in this case there has been a significant change in the volume of US imports since 1960 1960 2006. only using the data for 2002–2005 is a close agreement (a difference of some 1%). This shows that the closer your actual data is to the forecast there is a better reliability. y ^ 1985–1989.8000.941.8297x2 18. y ^ 1975–1979.457 million. r2 0. polynomial: 284. y ^ 2002–2005.1 1 At 25.9758 175. forecast for 2006 is $169. The answer to Question 7 is almost twice that of Question 4.00 million Using the actual data for 1985–1989. introduced in 1982.380 million. which is close to unity.298.1087x. International Herald Tribune.710. 270.250.9158 3.449.00 million The actual value for 2006 is $1.9935 Linear: $1.159. r2 0.599.9973 68.50.416. y ^ y ^ y ^ 32.80.563.20x 7. the slope of the curve increases. compact disc faces retirement.5400x3 28. r2 0.345.500 Statistics for Business 11.763. r2 0.836.20 million Using the actual data for 1975–1979.80x 70.9712 27.988.088.2310x4 9. forecast for 2006 is $1.80x 55.675 million As the years move on.756.778.799.070. r2 0.581.083.00 million Using the actual data for 2002–2005.472.

Solutions – English pubs Question 1 2 3 Answer There is a seasonal variation with highest sales in the spring and summer which is to be expected. spring 1. Solutions – Mersey Store Question 2 Answer Q1: 14.1427.551. There is a lot of concern about the amount of beer consumed by the English and the hope is that consumption will decline or at least not increase 14. spring 25. summer 117. summer 1.73% above and winter 89.905.7796 364.013.1997. Solutions – swimwear Question 1 2 3 Answer There is a seasonal variation and sales are increasing over time y ^ 6. Q2: 15.514 litre in 2010. summer 2. fall 0. spring 2.544.2587.0873.598.Appendix IV: Answers to end of chapter exercises 501 13. fall 0. summer 1.146.5298.515. spring 97. summer 22.66773.0874 r2 0. If dollar sales are used this masks the impact of price increases over time 4 5 6 .2052.919. Q4: 15.017.4283x 1.6686.9627.830. This means that for 2005 winter sales are 74. fall 0. Q3: 16.200 litre 8. sales are increasing over time y ^ 32.12% below Winter 0.1120 Winter 3.9720.750.021 15.13% below the yearly average. This means that for 2004 winter sales are 80. summer 2.63% above and winter 91. Also.5606.20% above. spring 1.2569.9031 4 5 6 7 Winter 328. spring 52. fall 0.963.1083.7968x Ratios for 2005 are winter 0.9843 Ratios for 2004 are winter 0.302. spring 1.27% below Winter 0.580. fall 149.173.402. spring 1. fall 1.1763. summer 66.555 units Unit sales are more appropriate for comparison.98% above.03% below the yearly average. summer 3.

173. 72. 72. 97. 148. 94. 140. 109. 61. 221. 70. 168. 81. 115. 221. 149 100. Solutions – backlog Question 1 2 3 4 5 Answer 1988 100. 80. 84. 216. 211. In 2005 it was 33% more than in 1996 1987 is a long way back. 222. 93. 88. 101. 133 In 1987 the price was 15% more than in 1996. 113. 89. 93. 89. 106. 1996 100. in 1993 it was 48% more than in 2000. In 2004 the backlog increased by 39% compared to 2003 2. 86. 99. 99. 98. 94. In 1994 the backlog decreased by 5% compared to 1993. In 2001 it was 39% less than in 1987. 81. In 1989 the backlog was 26% more than in 1988. 114. 87. 85. 126. 63. 86. 147. 117. 222. 2000 159. 97. 113. 100. 85. 105. 144. 1996 87. In 1998 the backlog decreased by 12% compared to 1997. In 2005 it was 16% more than in 1987 1988 115. 2000 is closer to the present time and all the index values are less than 200 making interpretation more understandable 1989 126. 139. 82. 95. 115. 101. 100. in 2000 it was 50% more than in 1988. Solutions – gold Question 1 2 3 4 5 6 7 8 Answer 1987 100.502 Statistics for Business Chapter 11: Indexing as a Method of Data Analysis 1. 101 6 7 In 1990 the backlog increased 14% over 1989. 132. 96. 86. 107. 98. 190. 107. 91. in 2005 it was 124% more than in 1988 1988 67. 95. 137. 95. In 1989 the backlog was 33% less than in 2000. 112. 2000 109. 224 150. 66. 147. 92. 126 1997 when the price decreased by 15% from the previous year 2005 when the price increased by 26% from the previous year . in 2005 it was 49% more than in 2000 1988 is a long way back and with a base of 1988 gives some index values greater than 200 which is sometimes difficult to interpret. 91. 84. 85. 158. 116 In 1996 the price was 13% less than in 1987. 63. In 2001 it was 30% less than in 1996. 70. 106. 237. 99. 72. 74. 2000 148. 100. 146. 114. A base of 1996 gives index values for the last 10 years 1988 98. 76. 117. 126. 77. 144. 105.

127. 56.8.0.0.7.9. 138. 310.6. 117.5% more than in 1975.0. 2006 245. 105. 99. 124. 221. 335. then the price of refined gasoline rises.0. 95.2% less than in 2000. 1990 340.6. 223.4. 122. In 1995 it was 19. 78.4. 101.5. 109. 113. Solutions – coffee prices Question 1 Answer 1975 100.7.3.7. In 1993 the price was 10.5.2. 89.6.2. 2000 100.0. 431. 59. However.0. In 1993 the price was 15. 2000 110.0.4.5.0. 359. 89. 113.0. 110. 177.8.9. 132.2. 95.0.3.9. 312. 79.2 100.1.8. 190. 101. 103. 94.5% less than in 1990. 115. 107. there is always a lag when gasoline prices change to correspond to crude oil price changes 4.1 In 1985 the price was 180.0. 119.5.4. 407.0.5. 109.2 In 2005 when there was a 35. 90. 297. 98.2.5. 150. 94. 2000 125.4. 140.3. 78.4.1.0.4% more than in 2000 1990 is in the last century.8. 101.3.5.6. 73.8. Solutions – US gasoline prices Question 1 2 3 4 5 6 7 8 10 Answer 1990 100.1. 84.9. 245.3.1.2. 130. 82. 125.2. 89. 62. 94.0. 306. 142. 1985 280.2 In 1985 the price was 25. 96.9.8% less than in 1990.8 In 1985 the price was 17.5% more than in 1990 or more than double the price 1990 83.2. 2006 116. 90.4.1.9.8.7. 86. 94.8 119. 79.5.7. 127. 100.4. 100.8. 2006 206. 334.6. 97.0 75.7.2% more than in 2000 (Continued) 2 3 4 5 6 . 87. 91. 94.3. 70. In 1995 it was 8.1.9.0. 119.8.4.3.0. In 2005 it was 111.8. This is to be expected.0. There is an approximate correlation between the two in that when crude oil prices rise.3.Appendix IV: Answers to end of chapter exercises 503 3.6. 82.5. 448. 244. In 2004 it was 36. 307. 135.2. In 2005 it was 77. 212.1. 96. 86. 78.0. 111. 71.3. 1995 108.3% less than in 2000. 65. 119. 324.5. 136. 476.8. 1985 74.0. 85. A base of 2000 gives the benchmark start for the 21st century 1991 94.3.8.8. 386. 135. 65.1. 59.9. 386. In 1998 it was 29.8. 84.7.6. 83.5. 66.3.6. In 2004 it was 365.0.6.8. 81.9. 78.0.8. 155. 37. 2000 119.2.1. 40.0. 100. 103. 293. 95.8. 70. 211.5.6. In 1998 it was 15. 2000 124.2.8. 82. In 1995 it was 307.1% less than in 2000.7. 113.1.6. 2004 465.6. 65.1. 81.9.0. 94. 108.7.2.2.3. 265.8.2% more than in 1975.1% more than in 1975 1975 29.9.0. 248.0. 2000 374.9. 98.9% increase over the previous year 1990 100. 95.0. 406. 65.8% more than in 1990 1975 26.4. 417.0.0. 72.3. 2000 114.9. 90. 175.6.2. 85.3.7% less than in 1990. 60.5. 95.3.7% more than in 2000. In 2004 it was 24. 1990 100.3.2.4. 120.8% more than in 1990. 102.7.4. 74.7. 90.

166. 121.5.6% more vehicles were sold than in 1992 Not that great. 107.9. 92. 130. 144.3.3. 126. 1995 105.9% higher than in 1992 1993 2005 108.4% Coffee beans are a commodity and the price is dependent on the weather and the quantity harvested.7. 159. 25. 96. 1985 132. 116. 138.8. 66.8. 93. 2000 104. 99. 2004 106.6.1% 1992 100. 95.3. 124.0.0.9.4. 1998 by 3.9.5% In 1981 when prices decreased by 20. 94.1. 103. dividends are half what they were in 1992.4.5. 120.3. 99. 114. 102. 2000 108. 95.9. 95. 2004 181.9. Solutions – Boeing Question 1 2 3 4 5 6 Answer 100. 93 There was a 7% annual decline in both 2002 and 2003 and a 4% annual increase in 2004 and a 5% annual increase in 2005 6.7. 91. stock price has fallen. 79. 96.8.8.1.4.0. North American vehicle sales look to be on the decline 133. 103. 155. 129.5.0. 160.6. 102.9 They were 81.9. 80.5.3.8. . 104. 93. 118.2. 92.0.504 Statistics for Business 4.8. 107. 114. 108. 139. 104.4.8.3. 114.8. 102.3 104. 84. Also country unrest can have an impact such as in the Ivory Coast in 2005 8 9 10 11 5.3.4. 174.7 In 1977 when prices increased by 121. 117.9. 93.6.0. Solutions – Ford Motor Company Question 1 2 3 4 5 6 7 Answer 1992 100. 119. 221. Solutions – (Continued) Question 7 Answer A base of 2000 is probably the best as it starts the 21st century.2. 108. 104.3. 92. 111.6.2. 115.9.0 2000 when 33. 155.2% and 2001 by 7.8.2.3.0.1.2. 106 They were 6% higher than in 2005 94.8. 90. 81. A base of 1975 is inappropriate as it is too far back and we are introducing percentage values more than 100% which is not always easy to grasp 1976 138. 106.3. 2000 163.5. 111.6.5.0. 98.3. 114.7.8. 100 They were 6% less than in 2001 105. 105.

61. 62. 138. 42. 40. 88. 77. 50. 100. 70. 49. 88 100. In Switzerland there is 29% less people working part time than in the Netherlands Australia 88. France 100. Sweden 300 In Ireland 767% more people than in France admit to have been drunk (almost 8 times as many). Solutions – cost of living Question 1 Answer Amsterdam 46% less. 39. 12. 13. 123. 49. In Greece there is 54% less people working part time. Tokyo 38% more. 169. 188. In Switzerland the proportion of women working part time is 7% less than in Britain 9. Solutions – part-time work Question 1 2 Answer Australia 208. 46. 61. 169. 43. 32. 101. Denmark 100. 196. Vancouver 53% less (Continued) . United States 36 100. 30. 108. Sweden 39 In Ireland 13% more people than in Britain admit to have been drunk. 113. 97. 89. 42. 77. 99. In Greece there is 39% less people working part time. 88. 72.Appendix IV: Answers to end of chapter exercises 505 7. In Greece the figure is about the same as in France and in Germany it is 233% more or over 3 times as much Britain 88. Britain 101. 277. 107. 58. Ireland 162. 867. United States 100 138. 115. New York 18% more. 136. 71. 333. 13. 100. 108. 233. Sydney 35% less. Netherlands 33. 13. In Southern Europe drinking wine with a meal is common and this perhaps suppresses the notion for “binge” drinking 8. 50. 200. 88. Paris 23% less. Berlin 58% less. Solutions – drinking Question 1 2 3 4 Answer Britain 100. 90. 90. In Greece it is about 88% less than in Denmark and in Germany it is 62% less than in Denmark Southern Europeans seem to admit being drunk less than Northern Europeans. 105. 44. 83. 533. 113. 105. In Greece the figure is 87% less than in Britain and in Germany it is 57% less Britain 767. 38. In Australia there is 108% more people working part time than in the United States. 867. 12. 97. 102. In Switzerland there is 96% more people working part time than in the United States Australia 75. 5 6 In Australia the proportion of women working part time is 12% less than in Britain. 82. 17. 12. 92. 15. 100. 28. 27. 68. 112. Sweden 35 5 6 7 In Ireland the drinking level is about the same as in Denmark. 3 4 In Australia there is 15% less people working part time than in the Netherlands. 137. In Greece the proportion of women working part time is 10% less than in Britain. 102.

Venezuela 60% more (Continued) 3 . Denmark 220. Solutions – road traffic deaths Question 1 2 Answer Mauritius is most dangerous and France is the least Belgium is 220% more dangerous. Germany 164. Mauritius 800% more. Berlin 4% less. Germany 126.0. Finland 147. Germany 117. Sydney 16% less.0.1.3% more. Paris 7% less.0. Finland 223. Berlin 6% less. Latvia 66.7% more dangerous.0 Portugal 100. Russia 33.506 Statistics for Business 9. Dominican Republic 680% more. Russia 300% more. Solutions – corruption Question 1 2 3 4 5 6 7 Answer Iceland is the least corrupt. Sydney 46% more.0. Tokyo 11% less. Venezuela 380% more Belgium is 6. Slovakia. New York 12% less.9 Italy 100.0.0. Sydney 15% less. France 73.3% less. Sydney 16% less. New York 14% less. United Kingdom 172. New York 14% less. Paris 9% less.7. Sydney 18% less. United Kingdom 132. Luxembourg 13.0 Greece 100.7.3% more. Finland 137. Denmark 146. Paris 9% less. Luxembourg 220% more. Tokyo 14% less. Vancouver 18% less Amsterdam 14% more.0. Denmark 190. Vancouver 16% less Amsterdam 10% more. Vancouver 10% less Amsterdam 23% more. Berlin 19% more. Latvia 400% more. United Kingdom 200.2. Paris 73% more. New York 16% less. Solutions – (Continued) Question 2 3 4 5 6 7 Answer Amsterdam 4% more. Berlin 8% less.7.1. United Kingdom 122. Greece. Berlin 6% less. New York 165% more. Germany 190. the countries of Czech Republic. Tokyo 13% less. Vancouver 16% less Amsterdam 12% more.7% more. Dominican Republic 160% more. Vancouver 15% less Most expensive is Tokyo which is 425% more than the least expensive or over 5 times more expensive 10.2. Tokyo 212% more. France 20% less. Denmark 135. and Namibia are equally the most 78% Spain 100. Tokyo 13% less.3 Of the selected eight countries from the European Union those in North appear to be less corrupt than in the South 11. Finland 192.9. Paris 11% less.3. Mauritius 200% more.

2004 100.0 100.3 The unweighted price index measures inflation.0.1 86. Russia 53.3. Dominican Republic 85. 2005 116. Solutions – family food consumption Question 1 2 3 4 5 6 Answer 2003 2003 2003 2003 2003 100. 2004 133. 2004 100.2 109. 2002 86. 2001 100.0. 2003 74. The unweighted quantity index measures consumption changes for this family. Mauritius 246.8% more. the denominator changes by almost proportionality the same amount 13.5. Latvia 92. Improved road conditions.4. 2005 95.2 109.2% more. 2003 86. 2001 71. better driving education. France 69.4.0 122.5 141.2% less.0.8% less dangerous. better traffic control. 2004 101. Russia 4. 2002 100. 2005 100.8% less. Solutions – (Continued) Question 4 5 Answer Belgium is 23. 2003 70. The other indexes take into account price and quantity.4.3. 2005 138. The values are close as although the numerator changes.9.8.3. 2002 87. This explains a reasonable trend of the indexes .1% more dangerous.2.7% more. Solutions – meat Question 1 2 3 4 5 Answer 2000 2000 2000 2000 100. Mauritius 114.Appendix IV: Answers to end of chapter exercises 507 11. 2003 116. France 81% less.0.0 The meat prices show a reasonable trend during the period and there is an increasing trend in the quantity of meat handled.3% more. 2004 100.4.9.4.3.3% more Belgium is 23.8. Luxembourg 30. 2001 89.6% more Emerging countries are more dangerous to drive than developed countries.6.0.8% more.4 109. 2004 112. 2004 110. 2004 100. 2002 70.3% more. Luxembourg 29% less.1. Dominican Republic 200% more. Latvia 19% more. Venezuela 84. 2001 87. 2004 83.0.6. and severer penalties when traffic laws are infringed 6 12. Venezuela 14.1.

6 100.0 117. 2003 102.0 100.6 There is neither a trend in the commodity price or the usage of the metals during the 6-year period.1.3. 2005 80. 2004 90.1. 2002 78. 2005 101. 2003 121.1. 2004 98. 2001 84.0. 2004 63.9.9. 2004 103. 2004 129. 2002 72. 2001 85.1. 2005 108. 2003 95. 2001 87.0.508 Statistics for Business 14.1.5. 2001 64.8.5. 2003 96.5.0.0.9. 2002 84. 2002 81. 2003 75. 2004 84.1.5. This helps to explain the fluctuation of the indexes .4.0.4.8.6. 2003 81. 2001 92. 2001 102. 2003 71. Solutions – non-ferrous metals Question 1 2 3 4 5 Answer 2000 2000 2000 2000 100.9.6. 2005 87. Solutions – beverages Question 1 2 3 4 5 Answer 2000 2000 2000 2000 100.0 142.9.4.1.2. 2002 104. 2002 75. 2001 116.5.5 100. 2004 123. This explains the fluctuation of the indexes 15.5.5. 2002 100.1 The commodity prices show no trend during the period but there is an increasing trend of the consumption of the commodity. 2002 56.1.0 100. 2004 90.2.0. 2005 146. 2003 52. 2001 82. 2005 82.9.8.6.0. 2005 118. 2005 114.

http://epp. Stats Means Business: A Guide to Business Statistics. ISBN 0-961-39214-2. Prentice Hall. Economist (The). and numerous sources of statistical information. http://fedstats. M. T. Connecticut. Prentice Hall. A. Buglear.S.gov.. The Visual Display of Quantitative Information. www.L. London. Statistics produced by more than 70 agencies of the United States Federal Government.com Fortune. ISBN 0-13-606716-6.unctad.. Against the Gods: The Remarkable Story of Risk. London. R. Graphics Press. Daily newspaper published Monday through Friday on a distinguishing salmon coloured paper. Basic Business Statistics. Football Annual 2005–2006.iht.com Levin. 6th edition.I. Learning. D. International Herald Tribune. 2nd edition. (1987). Alreck.M. New York. Thomson. (1998).europa.com Eurostat. London. edited in Paris and Hong Kong and printed in Paris. USA. Levine.uk United Nations Conference on Trade and Development. www.org Tufte. First published in 1843. ISBN 0-7506-5364-7. E. 10th edition. ISBN 0-47112104-5. (1998). Organization for Economic and Development. Butterworth-Heinemann. ISBN 1-84480-128-4.L. D. Operations Management: A Supply Chain Approach.C. economist. Wiley. www. (2003). oecd. Weekly publication. P. Pearson.ec. Europe. Published monthly in the Netherlands by Time Warner. (Ed. D.com Waller. ISBN 0-13-196869-6. Statistical information for Europe and elsewhere. Oxford. New York. Daily newspaper published Monday through Saturday.gov Wall Street Journal. USA. Printed in Belgium. Invincible Press. (2003).wsj. and Krehbiel.) (2005). www.eu Financial Times.Bibliography There are many books related to statistics. 7th edition. Published by the New York Times. (200