You are on page 1of 210

SIM MODULE BOOK

BUSINESS STATISTICS

1
Module Book
BUSINESS STATISTICS

Module Book Developer : Tan Suat Pheng

Production : SIM Global Education

Module Book ã SIM Global Education 2020

All rights reserved.


No part of this material may be reproduced in any form or by any means without permission
in writing from SIM Global Education

First Version @ January 2020

2
Table of Contents

PAGE

INTRODUCTION 4

Session 1: Introduction to Statistics 10

Session 2: Organising & Presenting Data 26

Session 3: Numerical Summary of Data 41

Session 4: Use of Excel for Data Analysis 1 55

Session 5: Probability 74

Session 6: Use of Excel for Data Analysis 2 94

Session 7: Linear Regression and Correlation 100

Session 8: Normal Distributions and Sampling Distributions 118

Session 9: Estimation 132

Session 10: One-Sample Hypothesis Testing 144

Session 11: Analysis of Categorical Data: Chi-square Test of Independence 156

Session 12: Comparing Means: Analysis of Variance (ANOVA) 165

Mock Examination Practice Paper (with answers) 178

Answers to discussion and supplementary questions 187

APPENDICES
- Appendix 1: List of Formulas 203
- Appendix 2: Standard Normal Table 206
- Appendix 3: t-distribution Table 207
- Appendix 4: Chi-square distribution Table 208
- Appendix 5: F-distribution Table 209

3
Module Book
BUSINESS STATISTICS
INTRODUCTION

Content

This module develops an understanding of statistical concepts, techniques and methods at a


basic level and an awareness of their applications in the business environment. This module
gives an overview of the descriptive and inferential statistics that may be used by managers.
Topics covered include: numerical measures, sampling methodologies, basic concepts of
probability and hypotheses testing, analysis of variance, correlation and regression and chi-
square applications. Students will be taught to apply the statistical techniques to support sound
decision-making.

Module Aims

The aims of this module are to:

1. Develop an understanding of descriptive and inferential statistical methods at an


intermediate level.

2. Develop an awareness of quantitative concepts and statistical applications in business and


management.

Learning Outcomes

On completion of this module, a participant will typically be able to:

1. Show a detailed knowledge and understanding of:

(i) Descriptive statistics and characteristics of data sets.


(ii) Basic problems in probability.
(iii) Association, correlation and regression models.
(iv) Sampling, estimation and the construction of confidence intervals.
(v) Hypothesis testing, chi-square and analysis of variance and their applications.

2. Demonstrate module specific skills with respect to:

(i) Organizing and portraying statistical data using tables and graphical techniques to
convey practical meanings.
(ii) Calculating probabilities of data sets.
(iii) Establishing associations between variables, so as to perform correlations and
estimations.

4
3. Show cognitive skills with respect to:

(i) Obtaining essential knowledge and techniques on methods for selecting samples from
population, as well as making statistical inferences about the population.
(ii) Applying statistical techniques to quantify information, analyze data, interpret results
and make sound decision-making.
(iii) Utilise Microsoft Excel to analyse and solve statistical problems.

4. Demonstrate transferable skills in:

(i) Analytical reasoning.


(ii) Numeracy and basic data analysis.
(iii) Communication.
(iv) Statistics in context.
(v) Problem formulation and decision making.

5
SESSION TOPIC LEARNING OUTCOMES TEXT
1. Introduction to Statistics Students should be able to: Business
• Definition of Statistics - Understand what is meant by Statistics Statistics
• Descriptive versus Inferential - Describe elements comprising the Module
Statistics decision-making process Book
• Population vs. Sample - Understand descriptive and inferential Session 1
• Types of Variables statistics
- Qualitative vs. Quantitative - Differentiate between population and
Variables sample
- Discrete vs. Continuous - Distinguish the types of variables
Variables being studied & their levels of
• Level of Measurements measurement
• Sampling Methods & Biasness - Identify the various methods of data
Involved collection
- Reasons for Sampling - Briefly describe various sampling
- Sampling and Data methods
Collection Methods - Identify various ways that biasness
- Bias in Statistics could occur

2. Organising & Presenting Data Students should be able to: Business


• Common methods of - Convert raw information into grouped Statistics
organizing & presenting data data format Module
• Stem-and-leaf Diagram - Construct frequency, relative frequency, Book
• Frequency Distribution and cumulative frequency distribution Session 2
• Relative Frequency Distribution - Display raw data in a stem-and-leaf
• Cumulative Frequency diagram
Distribution - Present quantitative data with
• Graphical Presentation of appropriate graphical display, namely
Quantitative Variables the histogram, frequency polygon
- Present qualitative data with
• Graphical Presentation of
appropriate graphical display, namely
Qualitative Variables
the bar chart and the pie chart
• Use of Excel
3. Numerical Summary of Data Students should be able to: Business
• Measures of Central Tendency: - Compute various measures of central Statistics
Ungrouped Data tendency, namely the mean, median and Module
• Measures of Central Tendency: mode Book
Grouped Data - Compute various measures of Session 3
• Measures of Dispersion: dispersion, namely the range, variance,
Ungrouped Data standard deviation
• Measures of Dispersion: - Use most appropriate numerical
Grouped Data measures to describe data sets
• Empirical Rule - Portray the data distribution according
• Use of Excel to these numerical measures
- Understand application of the Empirical
Rule
4. Use of EXCEL for Data Analysis 1 Students should be able to: Business
- Understand the basics of Excel Statistics
- Input data using Excel Module
- Generate frequency Book
tables/distributions and contingency Session 4
tables
- Generate pie-charts, bar-charts,
histograms and scatter diagrams
- Generate descriptive statistics – mean,
median, mode, range, variance and
standard deviation

6
SESSION TOPIC LEARNING OUTCOMES TEXT
5. Probability Students should be able to: Business
• The Language of Probability - Define basic terms used in probability, Statistics
• Probability Rules namely experiment, outcome/event Module
• Addition Rule and sample space. Book
• Multiplication Rule - Understand mutually exclusive and Session 5
• Conditional Probability independent events
• Bayes Theorem - Draw Venn Diagrams for computing
• Discrete Probability probabilities
Distribution - Understand and apply basic addition
and multiplication rules and special
addition and multiplication rules in
computing probabilities
- Understand and compute conditional
probabilities
- Apply Bayes’ theorem and draw Tree
Diagrams
- Compute the mean, variance and
standard deviation of a Discrete
Probability Distribution.

6. Use of EXCEL for Data Analysis 2 Students should be able to: Business
- Perform data analysis using Excel Statistics
Module
Book
Session 6
7. Linear Regression and Correlation Students should be able to: Business
• Relationship between Two - Construct and interpret scatter plots of Statistics
Quantitative Variables: bivariate quantitative variables Module
Correlation and Regression - Identify types of relationships between Book
Analysis two quantitative variables Session 7
• Analysing Associations with - Fit a regression equation using least
EXCEL squares method
• Limitations of regression - Interpret the slope and y-intercept in the
analysis regression equation
- Calculate and interpret the correlation
coefficient
- Calculate and interpret the coefficient of
determination
- Understand the limitations of linear
regression

8. Normal Distributions & Sampling Students should be able to: Business


Distributions - Compute the areas / probabilities for a Statistics
• Normal Distributions normally distributed variable Module
• Sampling Distribution of a - Understand the sampling distribution of Book
Sample Mean sample means and its applications Session 8
• Central Limit Theorem - Understand Central Limit Theorem

7
SESSION TOPIC LEARNING OUTCOMES TEXT
9. Estimation Students should be able to: Business
• Types of Point Estimates - Explain the difference between a point Statistics
• Confidence Interval for a estimate and an interval estimate Module
Population Mean - Use normal distribution to construct a Book
• Confidence Interval for a confidence interval for population mean Session 9
Population Proportion and proportion
• Factors Influencing Confidence - Use t distribution to construct a
Interval Width confidence interval for population mean
• Sample Size Determination - Decide whether normal or t distribution
should be used in constructing
confidence interval for population mean
- Understand the factors influencing
width of Confidence Interval
- Determine a sample size at specified
levels of confidence and margin of error

10. One-Sample Hypothesis Testing Students should be able to: Business


• Composing Hypothesis - Transform problems into appropriate Statistics
• Decisions about a Population null and alternative hypotheses Module
Mean - Carry out a hypothesis test on a Book
• Decisions about a Population population parameter (mean & Session 10
Proportion proportion)
• The Connection between - Understand level of significance & Type
Confidence Intervals and I and Type II errors
Hypothesis Testing - Understand how confidence interval
relates to hypothesis testing.

11. Analysis of Categorical Data: Chi- Students should be able to: Business
Square Test of Independence - Organize categorical data into a Statistics
• Contingency Table Analysis contingency table Module
• Exploring relationship between - Set up appropriate null and alternative Book
two qualitative (categorical) hypotheses Session 11
variables - Compute expected frequencies, degrees
of freedom from a given contingency
table.
- Apply chi-square distribution to perform
a test of association
- Understand precautions about use of chi-
square

12. Comparing Means: Analysis of Students should be able to: Business


Variance (ANOVA) - Understand the general approach to Statistics
• F-distribution analysis of variance techniques Module
• One-way Analysis of Variance - Describe the type of application that Book
– Comparing Means among analysis of variance is used for Session 12
Independent Samples - Differentiate ‘between-sample’ and
• EXCEL Applications ‘within-sample’ variation
- Read and apply the F – distribution table
and degrees of freedom associated with
F-Distribution

8
Teaching and Learning Methods
Participants will learn through a combination of lectures and practical activities. Participants
will be expected to learn independently by carrying out reading and directed study beyond
that available within taught classes.

Indicative Readings
Recommended Text Tan Suat Pheng, Business Statistics Module Book, SIM Global
Education, 2020

Assessment/coursework
All assessments must comply with the SIM Rules and Regulations. To satisfy requirements,
students must:
1) Satisfactorily complete and present on due dates their completed assignment. A penalty of
20% of the total marks will be imposed for late submission. A submission made later than
1 calendar day past deadline will receive a zero mark.
2) Complete all assignments and the final examination in a satisfactory manner.
3) Reference all their work and observe SIM’s policy on plagiarism. Students found guilty of
plagiarism will be dealt with severely.
4) Adopt either the Harvard or APA (American Psychological Association) Referencing
Styles.
5) Spend at least 100 hours (including class attendance and assignments) on the module in
order to fare reasonably.

Specific for this module are the following requirements:


Weighting between components A and B - A: 60% B: 40%
Element Description % of Assessment
Component A (Controlled Conditions)
Examination (120 minutes) 60%
Component B (CA: Continuous Assessment)
1. CA1
40%
2. CA2
Total (Component A+B) 100%

Calculators
Only non-programmable calculators (including non-programmable scientific) are permitted in
examinations. Listed below are some models that students can use:

Casio
FX82MS FX85MS FX95MS FX82ES PLUS

Sharp
EL509WS EL506W EL-570ES Plus

9
BUSINESS STATISTICS

SESSION 1

INTRODUCTION TO STATISTICS

At the end of the session, students should be able to:

1. present a broad overview of the subject of statistics and its importance.


2. explain basic terms commonly used in statistics.
3. distinguish between descriptive and inferential statistics.
4. classify variables as qualitative or quantitative, and discrete or continuous.
5. distinguish between nominal, ordinal, interval and ratio levels of measurement.
6. explain the reasons for sampling and describe various methods of sampling.
7. evaluate why bias occur in surveys.
__________________________________________________________________________

1. Introduction

1.1 What is Statistics?

In common usage, many would refer to Statistics as numerical facts, for example average
starting salary of graduates or average number of cars sold in a month. Some others refer to
it as a way of collecting and displaying large amounts of numerical information. And to still
another group it is a way of “making decisions in the face of uncertainty.” Each of these point
of view is correct.

Every day, we make decisions that may be personal or business related. Many a time, the
situation or problems that we face in the real world have no precise or definite solutions. It is
from this perspective of informed and more effective decision making that we consider why we
need to know about statistics. Data are collected everywhere and require statistical knowledge
to make the information useful.

If we refer to statistics as a field or discipline of study, then we may define it as follows:

STATISTICS is the science of collecting, organising, presenting, analysing


and interpreting numerical data to assist in making more effective decisions.

1.2 Basic Terms in Statistics

To study statistics, we need to be able to speak its language. We will define some basic terms
commonly used in statistics and in research.

Population
The entire set or collection of people or objects of interest. Example: the entire collection of
students in a university. The population that is being studied is also known as the target
population.

10
Sample
A subset or portion of the population. Example: 10% of the students in the university were
surveyed. A sample that is chosen to represent the characteristics of the population as closely
as possible is known as a representative sample.

Parameter
A numerical measure that is computed to describe a characteristic of an entire population.
Example: The average age of all students who were admitted to ABC university this year was
21 years. The value 21 is a parameter.

Statistic
A numerical measure that is computed to describe a characteristic from a sample. Example:
The average height of 25 randomly selected female students was 1.6 metres. The value 1.6 is
a sample statistic.

Census
A survey that includes every member in the population. Example: 100% of households in
Singapore are surveyed once every 10 years.

Example 1.1
Indicate which of the following refers to a population and which refers to a sample:

Description Population Sample


(a) The average monthly commissions earned by all
property agents of PropNet Ltd.
ü
(b) A survey of 100 employees from various
departments of Syme Darby Trading suggested ü
that staff are in favour of job rotation.
(c) Arif tested 10 paint cans to determine the paint
quality in this current shipment.
ü
(d) Professor Ho evaluated the mathematics
examination results for all the 20 students in ü
his class.

2. Types of Statistics

Broadly speaking, statistics can be divided into two areas: Descriptive statistics and Inferential
statistics:

Figure 1.1 Types of Statistics

11
2.1 Descriptive Statistics

When data are first collected, they are known as raw data. Raw data sets can be very large. This
makes it difficult to draw conclusions or make decisions with data in its original form. To have
a better understanding of the data, we can organise or tabulate the data, construct charts or
graphs and compute some summary measures. The portion of statistics that help us do these
tasks is known as descriptive statistics.

Descriptive statistics can be defined as those methods that involve the


collection, presentation and characterisation of a set of data in order to describe
the various features of that data set meaningfully.

2.2 Inferential Statistics

A major portion of statistics deals with drawing conclusions, predictions and generalisations
about the population based on results obtained from samples. For example, we draw
conclusions about the satisfaction of all customers of a restaurant by surveying say, 100
customers. A quality control manager may inspect randomly selected products from a batch of
production to make a decision about the quality of products from that production run.

Inferential statistics can be defined as those methods that make possible the
estimation of characteristics of a population based only on sample results.

Example 1.2
Which branch of statistics do they belong to?

Description Descriptive Inferential


(a) 80% of students from Class L01B passed the
mathematics examination.
ü
(b) A sample survey of 1000 voters showed that
55% will vote for KLM party. This will be used ü
to predict voters’ behaviour on election day.
(c) A quality controller found that 3% of sample
items produced by Factory B were defective. He
used this to estimate the proportion of defective
ü
items produced at Factory B.
(d) The number of cars sold this month is an
increase of 25% when compared to the previous ü
month.

3. Types of Variables

A variable refers to a characteristic of interest of each individual element of a population or


sample.

There are two basic types of variables namely Qualitative and Quantitative. (See Figure 1.2)

12
Figure 1.2 Types of Variables

3.1 Qualitative variables

Qualitative variables yield non-numeric or categorical responses. Examples of qualitative


variables are gender, colour of car and favourite movie star.
A variable that does not assume a numeric but can be classified into two or
more non-numeric categories is known as a qualitative or categorical
variable. Data collected from a qualitative variable are known as qualitative
or categorical data.

3.2 Quantitative variables

Quantitative variables yield numeric responses. Examples of quantitative variables are weight,
height and number of siblings.
A variable that can be measured numerically is known as a quantitative
variable. Data collected from a quantitative variable are known as quantitative
data.
Quantitative variables can be subdivided into two classifications: Discrete and Continuous.
Although there could be exceptions, the only distinction that we will make here is that a discrete
variable arises from counting while a continuous variable arises from measuring.

A variable that has values which are countable is known as a discrete


variable.
You count the number of siblings you have or the number of cars owned by a family. Hence,
these are discrete variables.

A variable that can assume any value over a specified range is known as a
continuous variable.
You measure height, weight, amount spent on books and travelling time. Hence, these are
continuous variables.

There may be some variables which appear numeric but should be classified as qualitative
variables. Examples are mobile phone number and car registration number. These are merely
identification numbers. They do not measure anything and you will not be able to do
mathematical computations on these variables. Hence, they are not quantitative variables.

13
Example 1.3
Classify the following variables as Quantitative (state Discrete or Continuous) or Qualitative:
Variable Variable Type
Quantitative,
(a) Amount spent on clothing last month.
continuous
(b) Favourite shopping mall. Qualitative
Quantitative,
(c) Time taken to serve a bank customer.
continuous
Number of subjects taken by a student this Quantitative,
(d)
semester. discrete

4. Scales of Measurement

Data can be classified according to scales of measurement, also known as Levels of


Measurements. There are four scales (see Figure 1.3) of measurement: nominal, ordinal,
interval and ratio. An awareness of the different scales of measurement will help to determine
the appropriate methods of statistical analyses that can be performed.

Figure 1.3 Scales of Measurement

4.1 Nominal Scale

Data that are qualitative or categorical have a nominal scale of measurement. Numbers are for
identification purpose and have no mathematical meaning. Examples are Colour of a car and
Preferred brand for a product.

The nominal level applies to data that are categorised and these categories
are used for identification purpose only.

We can neither rank the categories nor do any mathematical operations (such as addition,
subtraction, multiplication or division). For example, for the variable Gender, we can assign
codes 1=Male and 2=Female. The codes have no mathematical meaning as we cannot say code
1 is superior to code 2.

14
4.2 Ordinal level

Data that has some order or can be ranked have an ordinal level of measurement.

The ordinal level applies to data that are categorised and these categories can
be ranked.

In a survey, people were asked to rate the service at a restaurant as excellent, good or poor.
These categories possess the characteristic that can be ranked. Excellent has the highest rank
and poor has the lowest rank. So, we have 1= Excellent, 2= Good and 3=Poor. Hence, we do
know that code 1 is more superior than code 2 and code 3. An important characteristic of using
an ordinal scale is that we cannot distinguish the magnitude of the difference between the rating.
We do not know if the difference between “excellent” and “good” is the same as the difference
between “good and “poor.

4.3 Interval level

Data that are numeric and for which the difference between two values are meaningful are said
to have an interval level.

Data with an interval scale contain a zero point but it does not mean absence of an attribute.
Examples of variables with an interval scale are temperature, intelligent quotient(IQ) and shoe
size. For temperature, for example, a zero value does not represent absence of warmness. In
fact, by our own measurement, it is cold! A zero IQ does not mean a person has no intelligence.
There is also no natural zero for size (shoe size, dress size etc.)

The interval level applies to data that can be ranked and the differences
between the two values can be calculated and interpreted.

The difference between 2 values for an interval scale variable can be interpreted. For example,
in an IQ test the difference between someone who scores 120 and someone who scores100 is
20. This difference is same as the difference between a score of 90 and 70. However, a
characteristic of data with an interval scale is that ratio does not make sense for such data. A
person who scores 120 in an IQ test is not twice as intelligent as a person who scores 60. Neither
is a temperature of 400 Celsius twice as warm as 200 Celsius. The foot length of a person who
wears shoe size of 4 is not half that of someone who wears shoe size 8.

4.4 Ratio Scale

The ratio level is the “highest” scale of measurement. Almost all quantitative variables are
recorded on the ratio scale.

Ratio scale applies to data with known units of measurement and all
arithmetic operations (addition, subtraction, multiplication and division) can
be done with meaningful interpretation.

15
Examples of variables with ratio scale of measurement are income, sales and weight. A zero
point has a meaning in ratio scale data. If you have zero dollars, it means you have no money.
A salesperson with zero sales means he did not sell any product.

Example 1.4
Scale of
Variable
Measurement
(a) Number of bedrooms in an apartment. Ratio
(b) Favourite car colour. Nominal
(c) Time taken to complete an assignment. Ratio
Rating of hotel service (1=Excellent to 5=
(d) Ordinal
Poor)
Highest education level (1=Completed
Primary education, 2= Completed
(e) Ordinal
Secondary education and 3=Completed
Tertiary education)
(f) Today’s temperature in Melbourne. Interval

5 Sampling Methods

As mentioned in Section 1.2, a sample is a subset of a population that is selected for analysis.
Rather than taking a complete census of the whole population, statistical sampling procedures
focus on a small representative group of a larger population.

5.1 Reasons for Sampling

The reasons for sampling are as follows:

(1) Time consuming to study the entire population.


(2) Cost of studying the population may be prohibitive.
(3) The destructive nature of testing.
(4) Difficulty of accessing the entire population.
(5) Adequacy of sample results for statistical inference.

5.2 Sampling Methods

Sampling methods can be classified under Probability and Non-Probability sampling methods.
(see Figure 1.4)

Probability Sampling
In Probability Sampling methods, the researcher selects random members from a population by
setting a few selection criteria. These selection criteria allow units to have a known chance
(not necessarily equal) of being selected.

Non-Probability Sampling
Non-probability sampling methods are reliant on a researcher’s ability to select members.
Hence, not every unit or person has a chance of being included in the sample.

16
Figure 1.4 Sampling Methods

The sampling methods are described below:

Convenience Sampling
This method is dependent on the ease of accessibility to units that you wish to survey. For
example, surveying passers-by in a busy street on their opinion of a new policy to be
implemented by the government.

Judgement Sampling
A sample is selected by the discretion of the researcher based on personal judgement about the
group of people who own qualities that a researcher expects from the target population.

For example, in the case of day-to-day business problems or public-policy creation, the
judgement sampling may be the only practical method that can be used to take the actions
immediately on the basis of estimates that are readily available with the businessmen and public
officials.

Simple Random Sampling


Each person or unit is chosen entirely by chance and each member of the population has an
equal chance of being included in the sample. One technique is to use a random number
generator or a table of random numbers.

In simple random sampling, each unit or person in the population has an


equal chance of being included in the sample.

Table 1 shows lists of random numbers. How do we use the table?

Assume you have a population of 50 persons and would like to select 5 persons at random.

17
Table 1.1 Random Number Table

(1) Assign a number to each person from 01 to 50

(2) Since the population size is a two-digit number, we will use the first two digits of the
numbers listed in the table.

(3) Start at any value in the table. Assume we land on 08 (see Table 1).

(4) The second number will be 47, the third is 02 and so on. If a number is not within the
range of 01 to 50, discard it. Continue until you find 5 of the numbers whose first two
digits are less than or equal to 50.

(5) From this table, we arrive at 08, 47, 02, 11, and 38.

(6) Result: Persons 08, 47, 02, 11, and 38 will be used for our random sample.

Systematic Sampling
In systematic sample, units of a sample are chosen at pre-defined fixed intervals. Some
examples are:

- A professor selects every 10th person from the list of names in the student register to
attend a seminar. It requires selection of a starting point for the sample.
- An auditor selects every 5th purchase order in a file for checking.
- A researcher decides to interview every 20th person leaving a home exhibition.

In systematic sampling, every ith person or unit in the population is selected


to be included in the sample.

18
Stratified Sampling
This is a sampling method where the population is divided into small groups (strata) that do not
overlap but represent the entire population. Units are then selected from each group
proportionately to form the sample. For example, a credit card company may create strata based
on income level – “less than $40,000”, “$40,000 to $60,000” etc. to study the type and level of
spending of credit card holders. Marketers can analyse which income groups to target to
formulate appropriate marketing strategies.

In stratified sampling, the population is divided into subgroups known as


strata. A fixed number of units are then selected from each stratum.

Assume a car distributor has 1000 customers last year spread out among various age groups. It
wishes to select a stratified sample of 100 customers for a survey. First, the distributors will
find the percentage representation of each stratum in the population. Using these percentages,
a proportionate number will be sampled from each stratum.

Age group (years) No. of customers % of Population Number sampled


Below 30 100 10% 10% x 100 =10
30 to 39 300 30% 30% x 100 = 30
40 to 49 400 40% 40% x 100 = 40
50 and above 200 20% 20% x 100 = 20
1000 100% Total =100

Cluster Sampling
Cluster sampling is a method where the researcher decides on some criteria to divide the entire
population into groups or clusters that represent the population. A sample will be selected from
one or two of these clusters. Some or all members in the selected clusters will be surveyed.

In cluster sampling, the population is divided into subgroups known as


clusters. Persons or units are then selected from some but not all the clusters.

Assume the housing board wishes to do a study of the expenditure patterns of HDB households
in Singapore. The population is already divided based on HDB estates. (assuming there are 20
such estates or clusters).
H1 H5 H9 H13 H17
H2 H6 H10 H14 H18
H3 H7 H11 H15 H19
H4 H8 H12 H16 H20

The clusters are homogeneous and each cluster represents the population well. The housing
board then selects any 2 clusters, say H1 and H5. Households in these 2 clusters were then
interviewed. Households in all other clusters are excluded from the survey.

If the number of households in the two clusters are too large, simple random sampling can then
be carried out within each cluster. This is known as multi-stage sampling.

19
Example 1.5
Identify the sampling method used in the following situations:
Sampling
Description
Method
The database of a large hospital contains records
of 10,000 patients. The records are sequentially
Systematic
(a) numbered from 1 to 10,000. A sample of 100
Sampling
patients was obtained by choosing patients
numbered 100, 200, 300, ……, 10,000.
A wholesale food distributor would like to test
the demand for a new food product. He
distributes food through five large supermarket Stratified
(b)
chains. The food distributor selects a sample of Sampling
stores from each chain and tests his new product
in these stores.
Interviewers station themselves near office
buildings, MRT stations and bus-stops to Convenience
(c)
interview people who pass by about a pending Sampling
increase in transport fares.
A private university has 5 groups (A,B,C,D,E)
of 10 students each pursuing a Masters in
Cluster
(d) Psychology programme. Only Group B was
Sampling
selected and the students interviewed about their
satisfaction with the programme.
To determine what class to put students into at a
school, names are entered into a software Simple random
(e)
program, which then randomly assigns students sampling
in each class.

6. Survey Methods

Data collected may be primary data or secondary data. The difference


between Primary and Secondary data in Statistics is that primary data are collected firsthand by
the researcher while the secondary data are readily available (collected by someone else) and
are available to the public through publications, journals and newspapers.

The common methods for primary data collection are:


- Face-to-face interviews (including door-to-door surveys, street intercepts and onsite
surveys)
- Online/ Mail questionnaire (self-completion)
- Telephone survey (similar to face-to-face but without personal interaction)
- Direct observation e.g. through focus groups

We explore the advantages and disadvantages of some of these methods.

Face-to-face interviews

Advantages: Can gather in-depth attitudes, allow for probing and getting detailed responses.
Disadvantages: Relatively expensive and time consuming, may require quiet area to conduct.

20
Mail questionnaire

Advantages: Allow time for people to answer questions, minimal staff requirements, able to
cover large geographical area
Disadvantages: Low response rate, questions may be misunderstood, require pre-test to
minimize bias.

Focus groups

Advantages: Larger group of participants at one time, group dynamics generate ideas.
Disadvantages: Difficulty of scheduling, require strong facilitator, may need special
equipment to record.

7. Bias

Bias is said to occur when the sample results are systematically different from the truth about
the population.

7.1 Selection Bias

This occurs when there is a tendency to include or exclude certain persons or units in the
sample. In other words, the sample selected does not accurately reflect the target population.
Examples:
- Using an online survey to research on the importance of smart technology in our lives may
exclude the elderly people.
- Using a call-in radio show that solicit audience participation on controversial topics like
gun control, setting up casinos etc. tend to over represent individual who have strong
opinions.

7.2 Response bias

Responses given are inaccurate for various reasons like ambiguous question wording,
sensitivity of information, leading questions or lack of interviewer training. Here are some
examples of poor question wording:
“Do you shop regularly?” (the word “regular” is ambiguous)
“Has any family member been treated for behaviour disorder?” (sensitivity of information)
“Should online purchases be delivered on time as part of customer service?” (leading question)

7.3 Non-Response bias

This means a tendency for certain type of persons or unit not to respond to the study. Non-
responders may have some similar characteristics. In a mail survey, the upper and lower social
class tend not to respond, which indicates that the viewpoints of middle class are overly
represented.

21
8. Discussion questions

1. Explain whether the following variables are quantitative (state discrete or continuous)
or qualitative.
(a) The built-in area of a HDB 5-room flat.
(b) The colour of Kevin’s new sports car.
(c) The number of applications received by a university.
(d) The “hotline” telephone number of ABC Bank.

2. The following database contains some information of selected employees in a company.


Identify and explain the level of measurement for each variable.
Performance Age Group
Employee Staff rating 1=Below 25 Years of
Gender Salary
Name code (1 = excellent 2= 25 to 50 experience
5 = poor) 3= Above 50
Salimah 18215 Female 3 $2,423.65 2 14
Soon An 24466 Male 2 $2,942.55 1 6
Chandra 07543 Male 2 $3,557.93 3 23

3. Indicate which of the following examples refer to a population and which refer to a
sample.
(a) An auditor selected 30 employee leave records from staff working at ABC Bank
for checking.
(b) Results of all students who sat for the examination were evaluated.

4. Indicate whether the statement refers to descriptive or inferential statistics.


(a) The lowest bid received for motorcycles during the last COE bidding was $1900.
(b) Medical researchers state that the risk of eye retina problems increases by 20%
when a person is over 60.

5. For each of the following statements, indicate whether the highlighted value is
parameter or a statistic:
(a) The average annual advertising expenditure was $20,000 obtained from a survey
of 50 retail stores at The Jewel, Changi Airport.
(b) The total amount of investment in financial products at all the branches of City
Bank was $288 million in 20X9.

6. Explain the MAIN type of bias that is evident in the situations below:
(a) A supermarket researching on expenditure of customers sent a questionnaire to
all its loyalty cardholders.
(b) A hotel requested its guests to drop by its admin office to do a survey about their
stay in the hotel.
(c) A researcher selected a group of senior bankers to study the cost of living in
Singapore.
(d) John, a grassroots member was asked to knock on every door of a HDB block
to get an idea about building a children’s playground in the vicinity.
(e) A student wrote the following question in a questionnaire. “Don’t you think the
very wealthy people should donate more to charity?”

22
7. Identify the type of sampling method used:
(a) At a birthday party, 50 children were each assigned a number from 01 to 50.
Five numbers were chosen at random to receive a prize.
(b) An auditor selected every 10th purchase order from a file for checking.
(c) In a study about credit card usage, customers were grouped into five groups
based on their spending level for the last six months. 10 customers were then
selected from each of these five groups.

23
9. Supplementary questions

1. State whether the statement refers to descriptive or inferential statistics.

(a) The consumption of health supplements is expected to increase by 10% next


year.
(b) 20 persons aged 50 to 60 were surveyed. It was found that they took an average
of 9 types of health supplements each day.
(c) This year’s graduate employment survey showed that the average monthly salary
of IT graduates is $4,200.
(d) The demand for university places is forecasted to increase 2-fold within the next
five years.

2. For each of the variables listed, indicate whether it is a quantitative (state discrete or
continuous) or qualitative variable.

Variable Variable Type


(a) Renovation costs for Samuel’s new apartment.
Number of audit jobs completed by Team A
(b)
this month.
(c) Best-selling car model this year.
Number of new car models launched by
(d)
distributors this year.

3. Identify the scale of measurement for each variable.

Scale of
Variable
Measurement
(a) Number of pet dogs owned by Ah Tim
(b) Consultation time with Dr Huan
(c) Chelsea’s dress size
Rating (scale of 1 to 10) of a car model by
(d)
a car magazine.
(e) Sugar content (in grams) of a soft drink
(f) Most popular music artiste this year

4. For the statements stated below, identify which refers to a population and which refers
to a sample.
(a) 20 bottles of wine in a production process were selected for a taste test.
(b) A nurse took the temperature and blood pressure of ALL patients at St Luke
Eldercare Centre.

5. Provide examples of situations where the following sample methods were used:
(a) Simple random sampling
(b) Systematic sampling
(c) Stratified sampling
(d) Cluster sampling

24
6. A garment factory has 200 workers. The average time taken to complete a particular
procedure by these 200 workers was 10.9 minutes. A sample of 20 workers was then
taken. The average time taken by these 20 workers was found to be 10.4 minutes.
(a) Which values represent parameters?
(b) Which values represent statistics?

7. A marketer wants to obtain feedback for the design of a new product packaging. Five
designs are being considered and respondents were asked to rank their preferences for
these designs (5= Most Preferred and 1= Least Preferred). The sample of respondents
was obtained by interviewing every 20th shopper who walks into a particular store.

(a) Are the data values obtained discrete or continuous?


(b) Identify the level of measurement for the response value.
(c) What sampling method was being used?

25
BUSINESS STATISTICS

SESSION 2

ORGANISING & PRESENTING DATA

At the end of the session, students should be able to:

1. develop frequency table and charts for qualitative variables.


2. develop frequency distributions and charts for quantitative variables.
3. develop charts for displaying raw numeric data.
4. develop charts for bivariate quantitative variables.
5. develop contingency table for bivariate qualitative (categorical) variables.
6. read information from tables and charts generated.
_________________________________________________________________

1. Introduction

When data are recorded in the sequence that they are collected, they are known as raw data.
Such data are random and unranked.

Table 2.1 and table 2.2 show examples of qualitative raw data and quantitative raw data
respectively.

Finance Business Economics Finance Economics


Business Others Others Business Others
Finance Others Finance Humanities Humanities
Economics Business Finance Business Business
Business Others Economics Finance Economics
Table 2.1 Major of students

64 42 83 24 12 15
67 51 77 57 81 19
62 46 35 27 69 41
64 25 48 64 72 48
50 34 75 38 51 26
Table 2.2 Transactions ($) at a cafe

When data sets are small, it is relatively easy to observe difference among the raw values
or ungrouped data. However, with moderate to large data sets, the pattern of variability
become less apparent. Hence, it is better to tabulate the data into more readable formats.

2. Organising and Graphing Qualitative Data

2.1 Frequency Tables

A frequency table exhibits how the frequencies are distributed over various categories. From
the data in Table 2.1, the variable is major. The number of students belonging to the various
majors is called the frequency of that category.

26
Example 2.1
Set up a frequency table for the data in Table 2.1.

Solution:
The completed table is shown below:
Type of Major Number of Students
Business 7
Economics 5
Finance 6
Humanities 2
Others 5
Total 25
Table 2.3 Frequency Table : Type of Major

2.2 Bar Charts

A bar chart (also called a bar graph) can be used to display qualitative data. We mark the various
categories on the horizontal axis and frequency counts on the vertical axis. Usually we leave a
gap between the categories.

Example 2.2
From the frequency table obtained in Example 2.1, construct a bar graph.

Solution:
8
7
Number of students

6
5
4
3
2
1
0
Business Economics Finance Humanities Others

Figure 2.1 Bar Chart : Type of Major

Sometimes, in a bar chart, the categories are marked on the vertical axis and the frequencies on
the horizontal axis. This is known as a horizontal bar chart.

2.3 Pie Charts

A pie chart is more commonly displayed in percentages, although it can be used to display
frequencies. The pie is divided into different portions that represent the percentages for the
different categories. These percentages are known as relative frequencies.

!"#$%#&'( *+ ,-., '.,#/*"(


Relative frequency = 0%1 *+ .22 +"#$%#&'3#4
X 100

27
Example 2.3
Create a pie chart based on the information from Table 2.3
Type of Major Number of Students Relative frequency (%)
Business 7 28.0
Economics 5 20.0
Finance 6 24.0
Humanities 2 8.0
Others 5 20.0
Total 25 100.0
Table 2.4 Relative Frequency - Type of Major

Solution:
The completed pie chart is shown in Figure 2.2.

Others
20% Business
28%
Humanities
8%

Finance Economics
24% 20%

Figure 2.2 Pie Chart : Type of Major

3. Organising and Graphing Quantitative Data

3.1 Frequency Distributions

A frequency distribution shows a listing of the variable into groups of values known as classes
and the number of values (frequencies) falling into each class. Note that the classes always
represent the variable. The classes are non-overlapping; that is, each value belongs to one and
only one class. The frequency distribution is sometimes presented together with the relative
frequency, cumulative frequency or cumulative relative frequency.

Example 2.4
The following data show the value of transactions (in $) of 30 transactions at a local cafe.
64 42 83 24 12 15
67 51 77 57 81 19
62 46 35 27 69 41
64 25 48 64 72 48
50 34 75 38 51 26

Set up a frequency distribution.

Solution:
Here are the guidelines to set up a frequency distribution:

28
Step 1: Decide on the number of classes, k

We use 2k > n, where n represents the number of observations. In practice, number of classes
can be a subjective choice.

Sample size n =30 25 = 32 > 30 => Use 5 classes

Step 2: Determine the class interval (i)

Equal class interval(i) or class width (w) is preferred. A class interval is defined as the
difference between the lower limits of 2 classes. To compute the class interval, we can use

𝐻−𝐿
𝑖≥
𝑘
where H = Highest value in the dataset
L = Lowest value in the dataset
k = Number of classes (obtained in step 1)

Highest value is 83 and the lowest value is 12 for this data set. So, we have
83 − 12
𝑖≥ = 14.2
5
We round up to a whole number, therefore we decide on a class interval, i = 15

Step 3: Set up the individual classes and count (tally) the number of observations for each
class.

The completed table is presented below together with additional information namely the
relative frequency and cumulative frequency columns. The relative frequency shows the
proportion (percentage) of observations falling within for each class grouping. The cumulative
frequency is a running total of the frequency counts. (See Table 2.5)

Cumulative Relative Frequency


Class Frequency
Frequency (%)
10 up to 25 4 4 4D x 100 = 13.3%
30
25 up to 40 6 4+6=10 6D x 100 = 20.0%
30
40 up to 55 8 10+8=18 8D x 100 = 26.7%
30
55 up to 70 7 18 +7 = 25 7D x 100 = 23.3%
30
70 up to 85 5 25 + 5 = 30 5D x 100 = 16.7%
30
30 100.0%
Table 2.5 Frequency Distribution - Transactions ($) at cafe

How do we do a tally (or frequency count)?

Note that when we have a class say, “10 up to 25”, it would include all values from 10 to less
than 25. In other words, a transaction value of $25 would be included in the class “25 up to
40”. Similarly, a transaction value of $40 would be included in the class “40 up to 55” and so
on.

29
For large data sets, manual tallying naturally becomes tedious. Hence, use of software packages
e.g. SPSS or Excel greatly assist in the setting up of frequency distributions or tables.

3.2 Histogram

A histogram is a chart that can be drawn for a frequency distribution or relative frequency
distribution. We mark classes on the horizontal axis and frequencies (or relative frequencies)
on the vertical axis. The columns are drawn adjacent to each other without leaving any gap
between them.

Example 2.5
Using the information from the frequency distribution in Table 2.5, construct a histogram.

Solution:
The completed histogram is shown Figure 2.3.

Figure 2.3 Histogram : Transaction value ($) at a café

Note that we may also use class mid-points instead of class limits to label the horizontal axis.
A class mid-point is calculated by summing the lower and upper limit of a class and then
dividing by 2. Thus,
G*H#" '2.44 2313,IJKK#" '2.44 2313,
Class mid-point = L

3.3 Frequency Polygon

A frequency polygon is a graphical display with lines connecting the intersection points of the
class mid-point and frequencies (or relative frequencies).

Example 2.6
The table below shows a frequency distribution of the ages (years) of all 50 employees of a
company. Compute the class mid-points. Thereafter, construct a frequency polygon.

30
Class Frequency Mid-point
20 up to 32 12 (20 + 32)/2 = 26
32 up to 44 17 (32 + 44)/2 = 38
44 up to 56 14 (44 + 56)/2 = 50
56 up to 68 7 (56 + 68)/2 = 62
50
Table 2.6 Computing Class mid-points
Solution:
The completed frequency polygon is shown in Figure 2.4

18
16
14
No of Employees

12
10
8
6
4
2
0
14 26 38 50 62 74
Age (years)

Figure 2.4 Frequency Polygon: Age of Employees

Note in Figure 2.4 that, to complete the frequency polygon, midpoints of 14 and 74 are added
to the X-axis to “anchor the polygon at zero frequencies. These two values, 14 and 74 were
derived by subtracting the class interval of 12 years from the lowest mid-point (26 years) and
by adding 12 years to the highest mid-pint (62 years) in the frequency distribution.

Both the histogram and the frequency polygon allow us to get a quick picture of the main
characteristics of the data (highs, lows and concentration of data etc.).

3.4 Cumulative Frequency Polygon

When plotted on a diagram, the cumulative frequencies give a curve that is called an ogive
(pronounced o-jive).

Example 2.7
Using the data from Table 2.6, compute the cumulative frequencies. Thereafter, construct a
cumulative frequency polygon.

Solution:
Class Frequency Cumulative frequency
20 up to 32 12 12
32 up to 44 17 12 + 17 = 29
44 up to 56 14 29 + 14 = 43
56 up to 68 7 43 + 7 =50
50
Table 2.7 Computing Cumulative Frequencies

31
To draw the ogive, the variable age is marked on the horizontal axis using the lower class limits
and the cumulative frequencies on the vertical axis. The dots are then marked above these limits
to correspond to the cumulative frequencies. (See Figure 2.5)

Figure 2.5 Cumulative Frequency Polygon : Age of Employees

Note that the cumulative frequency polygon starts at the lower limit of the first class and ends
at the upper limit of the last class.

3.5 Stem and Leaf Diagram

Another technique to present quantitative data is the stem and leaf diagram. An advantage of
the stem and leaf diagram is that we are able to view the data distribution featuring the actual
numerical values of the raw data.

The data are arranged in ascending order. Each numerical value is an observation divided into
two parts. The leading digit(s) is the stem and the remaining digit(s) forms the leaf. The
arrangement of leaves on the stems provides a pictorial representation of the distribution.

Example 2.8
The following data shows the dividend yield (in percent) of 12 blue chip stocks. Create a stem-
and-leaf diagram for this set of data.
4.5 3.7 4.4 3.8 7.7 3.8 3.5 3.4 3.0 4.6 3.9 2.3

Solution:
The first digit would form the stem and the second digit the leaf.

The stem-and-leaf diagram would appear as follows:


Stem Leaf
2 3
3 0 4 5 7 8 8 9
4 4 5 6
5
6
7 7
Stem unit = 1 Leaf unit = 0.1
Figure 2.6 Stem and Leaf Diagram (Dividend Yield)

32
Sometimes, we may want to construct a stem and leaf diagram for three- and four-digit
numbers.

Example 2.9
The following data gives the monthly rents ($) paid by a sample of 30 households from a certain
city. Create a stem and leaf diagram.

429 540 550 578 585 620 650 660 675 732
750 750 765 780 800 820 840 870 871 880
900 930 950 956 975 989 1020 1020 1030 1070

Solution:
The stem and leaf diagram would appear as follows:
Stem Leaf
4 29
5 40 50 78 85
6 20 50 60 75
7 32 50 50 65 80
8 00 20 40 70 71 80
9 00 30 50 56 75 89
10 20 20 30 70
Stem unit = 100 Leaf unit = 1

Figure 2.7 Stem and Leaf Diagram (Monthly Rent)

4 Presenting Two Variables

Sometimes we may have bivariate data, that is, data for two variables where we want to
compare and find relationships. For 2 quantitative variables, a scatter diagram can be drawn.
If we have 2 qualitative (categorical) variables, we can summarise the data using a contingency
table.

4.1 Scatter Diagram

The Scatter Diagram is the simplest method to study the relationship between two variables
where the values for each pair of variables are plotted together in the form of dots. The degree
to which the variables are related to each other depends on the manner in which the dots are
scattered over the chart. The more the dots plotted are scattered over the chart, the lesser the
degree of correlation between the variables.

Examples of scatter diagrams are shown in Figure 2.8. We can see that sales level and
advertising expenditure are linked. Sales level is also linked to the number of employees;
however, the relationship is weaker as the points are more scattered. In both charts, it is obvious
that an increase in the horizontal axis variable (either advertising expenditure or number of
employees) resulted in an increase in the vertical axis variable (sales).

We will deal more with scatter diagrams in Session 7 (Linear Regression & Correlation)

33
Sales ($)
Sales ($)

Advertising ($) Number of Employees

Figure 2.8 Scatter Diagrams

4.2 Contingency Table

Contingency tables (also known crosstabulations or two-way tables) are used in statistics to
summarize the relationship between two categorical variables. A contingency table is like a
special type of frequency table, where two variables are shown simultaneously.

Table 2.8 shows a sample of 100 persons from 3 different age groups and their preferred
activity.
Age Group
Preferred
< 25 years 25 to 50 years Over 50 years Total
activity
Brisk walking 0 14 15 29
Swimming 20 11 10 41
Tennis 20 5 5 30
Total 40 30 30 100
Table 2.8 Contingency Table

Is preferred activity related to age group? You would probably be able to see some relationship
by examining the frequency counts from the table.

We will deal more with contingency tables in Session 11 (Analysis of Categorical Data: Chi-
Square Test of Independence).

34
5. Discussion questions
1. D’drink Inc asks 100 randomly sampled customers to take a taste test and select the
beverage they preferred most. The results are shown in the following table:

Beverage Number
Cola-plus 40
Coca-cola 25
Pepsi 20
Lemon-lime 15
TOTAL 100

(a) Is the variable qualitative or quantitative? Why?


(b) What is the table called? What does it show?
(c) Develop a bar chart to depict the information.
(d) Develop a pie chart using the relative frequencies.

2. The time spent (minutes) in a restaurant for 30 customers is shown below:

75 54 62 79 79 53 67 60 60 105
58 51 69 65 90 98 82 93 60 93
74 77 42 84 88 69 74 73 64 114
Construct a frequency distribution. Use the value 40 as the lower limit of the first class.
Also, compute the relative frequencies.

3. The annual imports of a selected group of electronic suppliers are shown in the
following frequency distribution.

Imports ($mil) No of Suppliers


2 up to 5 6
5 up to 8 13
8 up to 11 20
11 up to 14 10
14 up to 17 1
50

(a) Draw a histogram.


(b) Draw a relative frequency polygon.

4. A sample of the hourly wages of 15 employees in a logistics company was organized


into the following table:
Hourly wages ($) Number of employees
8 up to 10 3
10 up to 12 7
12 up to 14 4
14 up to 16 1
(a) What is the table called?
(b) Develop a cumulative frequency distribution and portray the distribution in a
cumulative frequency polygon.

35
(c) On the basis of the cumulative frequency polygon, how many employees earn
$11 an hour or less? Half of the employees earn an hourly wage of how much
or more? Four employees earn how much or less?

5. The rate of return (%) of 21 unit trusts during a boom year are shown below :
8.3 9.6 9.5 9.1 8.8 11.2 7.7 10.1 9.9 10.8
10.2 8.0 8.4 8.1 11.6 9.6 8.8 8.0 10.4 9.8 9.2

(a) Organize this information into a stem-and-leaf display.


(b) How many rates are less than 9.0?
(c) List the rates in the 10.0 up to 11.0 category.
(d) What is the median (i.e. the middlemost value)?
(e) What are the maximum and the minimum rates of return?

6. A survey was carried out on readers of a travel magazine. Respondents were asked
the amount they spent on holidays in the previous year. The frequency distribution is
shown below:
Amount Spent ($) Frequency
3000 up to 6000 22
6000 up to 9000 53
9000 up to 12000 19
12000 up to 15000 1
15000 up to 18000 3

(a) How many readers responded to the survey?


(b) What percentage of readers spent $9,000 or more on holidays?
(c) Construct a frequency polygon and describe the shape of the distribution.

7. A megamall is considering revising its parking fees because of extreme heavy traffic
on weekends. A study was carried out to record the duration of stay (in minutes) of
500 cars entering the car park on a Saturday.

Length of stay (minutes) Number of cars


0 up to 30 40
30 up to 60 50
60 up to 90 90
90 up to120 150
120 up to 150 60
150 up to 180 110
Total 500

(a) Calculate the relative frequencies (express answers in percentage).


(b) What percentage of cars was parked for less than one hour?
(c) Draw a histogram representing the information in the above table.

36
6. Supplementary questions
1. A Food Company has been serving a large range of breakfast cereals with an additional
flavoring, Koko Krunch, which has gained popularity among its customers. The
company is interested in finding out the customer preferences for Koko Krunch versus
Trix, Honey Stars, Cookie Crisp, and Snow Flakes. 360 customers were surveyed at
random by getting them to take a test and select the breakfast cereal they preferred the
most. The results are as shown below:
Breakfast Cereal Number
Koko Krunch 115
Trix 63
Honey Stars 102
Cookie Crisp 43
Snow Flakes 37
Total 360
(a) Is the data presented above qualitative or quantitative? Explain your answer.
(b) What do you call this table? What information does it present?
(c) Create a bar chart to illustrate the information above.
(d) Create a pie chart using the information above.

2. The table below shows the frequency distribution for the profit earned on furniture sold
in December 20x8 at the Orange Lights Furnishing Pte Ltd.
Profit ($) Frequency Relative
Frequency (%)
100 up to 300 18
300 up to 500 38
500 up to 700 57
700 up to 900 43
900 up to 1,100 33
1,100 up to 1,300 11
Total 200

(a) Complete the relative frequency column.


(b) How many pieces of furniture belong to the $500 up to $700 class?
(c) What is the proportion of furniture sold for a profit of between $500 up to $700?
(d) What is the proportion of furniture sold for a profit of $900 or more?

3. A company sold the following number of units of laptops in the last 14 months.

33 38 49 27 38 29 38
38 38 48 39 50 29 31

You wish to construct a frequency distribution.


(a) What number of classes would you recommend?
(b) Suggest the class interval to be used for the frequency distribution.
(c) Set up the frequency distribution using 27 as the lower limit of the first class.
.

37
4. The following data give the prices ($) for a sample of 20 books on investment tips.
32.5 24.0 37.0 67.9
43.7 49.0 62.7 15.8
19.5 59.5 31.7 54.7
29.7 44.5 22.7 27.4
53.4 17.4 47.5 43.2

(a) Construct a frequency distribution table for the above data using “10.0 up to
20” as the first class.
(b) Calculate the relative frequencies.

5. In CKS store, a customer service counter was set up to cater to enquiries and requests
from customers. The frequency distribution below shows the waiting time of customers
on a particular day.
Waiting Time (in minutes) No of customers
0 up to 5 4
5 up to 10 3
10 up to 15 6
15 up to 20 7
20 up to 25 4
25 up to 30 3

(a) How many customers made enquiries on that particular day?


(b) What is the percentage of customers who waited 15 minutes or more?
(c) Name a graphical presentation that can be used to display the above data.
Construct the graphical presentation you have stated.

6. The following frequency distribution shows the order lead time (time elapsed between
when an order is placed and when it is filled) at TwinkleBell.com, an online blogshop
retailer.

Lead Time (days) Frequency


0 up to 5 15
5 up to 10 28
10 up to 15 19
15 up to 20 10
20 up to 25 6
25 up to 30 2
Total 80

(a) State the number of orders filled in less than 15 days.


(b) State the number of orders filled in less than 20 days
(c) Using the data in the table above, construct a cumulative frequency distribution.
(d) Construct a cumulative frequency polygon.
(e) In less than how many days do they have 60 percent of their orders filled in?

38
7. The following are the responses of 20 students of an accounting class who were asked
to evaluate the usefulness of the course. The students were asked to choose one of five
responses: Excellent (E), Above average (AA), Average (A), Below average (B), and
Poor (P).
AA B A E AA
E AA B AA E
E A B AA E
A E P P B

(a) Construct a frequency table with relative frequencies.


(b) Explain briefly the usefulness of grouping data into a frequency table.
(c) What percentage of students rated the usefulness of the course as excellent or
above average?
(d) Draw a bar graph for the frequency table constructed in (a).
(e) Besides a bar graph, state one other method of graphical presentation that would
be suitable to present the data. Why is this method suitable?

8. Construct a scatter diagram for the following sample data. Describe the relationship
between the following X and Y values.

X 10 15 11 17 11 9 12 10 11 13
Y 5 6 7 8 4 3 6 2 2 7

9. A dessert palour is interested to investigate the relationship the gender of a customer


and whether the customer orders ice-cream. The table below shows the information,
collected by the Operations Manager, on 100 customers.

Gender
Ice-cream Ordered Total
Women Men
Yes 39 19 58
No 11 31 42
Total 50 50 100

(a) State the level of measurement of the two variables?


(b) Name the table presented above.
(c) Does the results shown in the table suggest that women are more likely to order
ice-cream than men? Explain your answer.

39
10. AA Fashion is considering a merger with BB Fashion. The board of directors surveyed
100 stockholders regarding their position on the merger. The following table below
presents the results of the survey.

Opinion
Number of
Total
Shares Held Favor Oppose Undecided
Under 100 16 9 3 28
100 up to 500 21 9 1 31
500 up to 1,000 13 11 1 25
Above 1,000 9 5 2 16
Total 59 34 7 100

(a) State the level of measurement for the two variables used in the table above.
(b) Name the table presented above.
(c) State the group that is most in favour of the merger.

40
BUSINESS STATISTICS

SESSION 3

NUMERICAL SUMMARY OF DATA

At the end of the session, students should be able to:

1. compute various measures of central tendency.


2. compute various measures of dispersion.
3. use the most appropriate numerical measures to describe data sets.
4. portray data distribution according to these numerical measures.
5. apply the Empirical rule for symmetrical distributions.
_________________________________________________________________

1. Introduction

In Session 2, we discussed how to organize data and present them using various charts. These
techniques are insufficient when we need to describe the main characteristics of data set.
Numerical summary measures that provide the centre and spread of a distribution will show us
the main features of data sets. These measures are known as measures of central tendency and
measures of dispersion.

Figure 3.1 Descriptive Measures

2. Measures of Central Tendency: Ungrouped Data

Data in its original form are known as raw or ungrouped data. The summary measures that
give averages are called measures of central tendency (also known as measures of location).
We shall look at three such measures – the mean, the median and the mode.

2.1 Mean

The mean, also called the arithmetic mean, is the most frequently used measure of central
tendency. For ungrouped data, the mean is obtained by dividing the sum of all values by the
number of values in the data set. The mean calculated for sample data is denoted by 𝑥̅ (read
as x-bar) and the mean calculated for population data is denoted by µ (a Greek character read
as mu).

41
∑Q
Population Mean : 𝜇= R

∑Q
Sample Mean : 𝑥̅ = &

where ∑ 𝑥 is the sum of all values of x, N is the population size, n is the sample size.

Example 3.1
The monthly salaries ($) of a sample of five employees for an IT company are as follows.
3700 4500 5300 7250 8250
Find the mean monthly salary.

Solution:
The mean will be denoted by 𝑥̅ since the data comes from a sample.
∑𝑥
𝑥̅ =
𝑛
TUVVIWXVVIXTVVIULXVIYLXV
= X

= $5,800

Sometimes a data set may contain a very large or a very small value. Such values are called
outliers or extreme values. A major disadvantage of the mean is that it is very sensitive to
outliers. Example 3.2 illustrates this.

Example 3.2
Further to Example 3.1, suppose an additional employee was sampled making a sample size of
6. The salaries of these six employees are:
3700 4500 5300 7250 8250 25000
Compute the new mean value.

Solution:
TUVVIWXVVIXTVVIULXVIYLXVILXVVV
The new mean will be 𝑥̅ = Z

= $9,000

The impact of the outlier is that it causes the mean to increase substantially from $5,800 to
$9,000. We should remember that the mean is not always the best measure of central tendency
because it is heavily influenced by outliers.

2.2 Median

The median is the value of the middle term in a data set with n values after arranging the values
in ascending order.

The position of the middle item is obtained as follows:


𝑛+1
𝑀𝑒𝑑𝑖𝑎𝑛 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 =
2

42
Example 3.3
Given the age (years) of 5 college students : 21, 25, 19, 20, 22
Find the median.

Solution:
First, arrange the 5 values in ascending order: 19 20 21 22 25
XId
Median position = L = 3rd value
Hence, the median is 21 years.

Example 3.4
Given the weight (kilograms) of 4 basketball players: 76 73 80 75
Find the median.

Solution:
First, arrange the 4 values in ascending order : 73 75 76 80
WId
Median position = L = 2.5th value
This means that the median lies between the 2nd and the 3rd value. The median is found by
taking the average of the two numbers.
UXIUZ
Hence, the median = L
= 75.5 kilograms

2.3 Mode

The mode represents the most frequent value in a data set.

Example 3.5
Find the modal weekly earnings for the following samples:

! 150 178 164 144 185 196 " No mode

# 150 185 164 144 185 196 " 185 (unimodal)

$ 144 185 163 144 185 196 " 144 and 185 (bimodal)

The main shortcoming of the mode is that a data set may have not have a mode or may have
more than one mode. A dataset with one mode is said to be unimodal. If there are two modes,
it is bimodal. If it contains more than two modes then it is multimodal.

Looking at the three measures of central tendency, we cannot conclude which is definitely a
better overall measure. Each of them may be better under different situations. Nevertheless,
the most commonly used measure is the mean followed by the median. The median is a better
measure when the data set has outliers.

3. Relationship Among the Mean, Median and Mode

Two of the many shapes of a distribution are the symmetric and the skewed distributions. In
Session 2 we learnt about histograms and frequency polygons. We shall now look at the values
of the mean, median and mode for different shapes of distribution.

43
3.1 Symmetric Distribution (Mean=Median=Mode)

If we have a symmetric histogram and frequency polygon with a single peak, the mean, median
and mode will be equal and they lie at the centre of the distribution. (See Figure 3.2)

Figure 3.2 Symmetric Distribution

3.2 Positively (Right) Skewed Distribution (Mean>Median>Mode)

If we have a histogram and a frequency polygon that is skewed to the right (See Figure 3.3),
the mean is the largest and the mode is the smallest. The median will lie between the mean and
the mode. The mean is the largest because it is affected by large value outliers which pulls up
the value of the mean.

Figure 3.3 Positively Skewed Distribution

3.3 Negatively (Left) Skewed Distribution (Mean<Median<Mode)

If we have a histogram and a frequency polygon that is skewed to the left (See Figure 3.4), the
mean is the smallest and the mode is the largest. The median will again lie between the mean
and the mode. The mean is the smallest because it is affected by some small value outliers
which pulls down the value of the mean.

44
Figure 3.4 Negatively Skewed Distribution

4. Measures of Dispersion: Ungrouped Data

Measures of central tendency give us an idea of the typical middle value in a dataset. However,
these measures do not reveal the full picture of the distribution. Two distributions may have
similar means but the spread (also known as dispersion or variability) of data could be
completely different.

Listed below are the test scores for two groups of students.

Class A: 55, 56, 57, 58, 59, 60, 60, 60, 61, 62, 63, 64, 65

Mean = median = mode = 60

Class B: 35, 40, 45, 50, 55, 60, 60, 60, 65, 70, 75, 80, 85

Mean = median = mode = 60

We can see that the two data sets have equal mean, median, and mode. However, the scores
for Class B are much more dispersed compared to Class A. Hence, we need some measures
that help us know about the spread of data. These measures are called Measures of Dispersion.
We shall look at just three (Range, Variance and Standard Deviation) amongst a number of
measures of dispersion.

4.1 Range

The range is obtained by taking the difference between the largest and the smallest values in a
data set.

Example 3.6
The closing month-end stock price ($) for DBS Bank over the last six months are as follows:
25.20 24.90 22.70 27.80 28.30 29.00
Find the range.

Solution:
Range = Largest value – Smallest value
= 29.00 – 22.70
= $6.30

45
The range, like the mean, has the disadvantage of being influenced by outliers. Hence, it may
not be a good measure of dispersion for datasets with outliers.

Another disadvantage is that the range uses only two values in the dataset regardless of the
dataset size. All other values are ignored.

4.2 Variance and Standard Deviation

The variance is defined as the average of the squared deviations of the data values from the
mean. The variance calculated for population data is denoted by s2 (read as sigma squared),
and the variance calculated for sample data is denoted by s2.

The formulas for calculating variance are:


∑(Qgh)j
Population variance: 𝜎L = R

∑(QgQ̅ )j
Sample variance: 𝑠L = &gd

The answers for variance is expressed in squared units, for example, dollars2, minutes2 etc.
which are actually not meaningful. For this reason, we obtain the standard deviation by taking
the square root of the variance.
∑(Qgh)j
Population standard deviation: 𝜎=k R

∑(QgQ̅ )j
Sample standard deviation: 𝑠=k &gd

Standard deviation is the most frequently used measure of dispersion. It tells us how closely
the values of the data set are clustered around the mean. A small value standard deviation
indicates a smaller variability of the data values around the mean. The standard deviation is
expressed in the same unit as the original variable value.

Example 3.7
Given the following data set: 6 3 8 5 3
Compute the variance and standard deviation
(a) Assuming that the data are from a population.
(b) Assuming that the data are from a sample.

Solution (a):
Assuming population data….
∑𝑥
𝜇=
𝑁
ZITIYIXIT
= X
=5

∑(𝑥 − 𝜇)L
𝜎L =
𝑁

46
(ZgX)j I(TgX)j I(YgX)j I(XgX)j I(TgX)j
= X

= 3.6

𝜎 = √3.6 = 1.897

The calculations can also be presented in tabular form:

x µ (x – µ)2
6 5 1
3 5 4
8 5 9
5 5 0
3 5 4
S x = 25 S (x – µ)2=18
∑𝑥
𝜇=
𝑁
LX
= X
=5

L
∑(𝑥 − 𝜇)L
𝜎 =
𝑁
dY
= X =3.6

𝜎 = √3.6 = 1.897

Solution (b):
Assuming sample data….

∑𝑥
𝑥̅ =
𝑛
ZITIYIXIT
= X
=5

∑(𝑥 − 𝑥̅ )L
𝑠L =
𝑛−1
(ZgX)j I(TgX)j I(YgX)j I(XgX)j I(TgX)j
= Xgd

= 4.5
𝑠 = √4.5 = 2.121

There are alternative computation formulas for calculating the standard deviation:
∑ Qj ∑ Q j g&Q̅ j
𝜎=k R
− 𝜇L 𝑠=k &gd

47
5. Mean, Variance and Standard Deviation for Grouped Data (samples only)

5.1 Mean for Grouped Data

When data are given in the form of a frequency distribution, we will not know the actual values
of the individual observations. Hence, we need to find an approximation for the mean of
grouped data. We will look at the mean for samples only.
∑ 𝑓𝑀
𝑥̅ =
𝑛
where M = midpoint of a class and f = frequency of a class

Example 3.8
A company recently surveyed a sample of employees to determine how far they lived from their
corporate headquarters. The results are shown below:
Distance (kilometres) Number of Employees
0 up to 5 4
5 up to 10 15
10 up to 15 27
15 up to 20 18
20 up to 25 6
Compute the mean.

Solution:
For grouped data, we need to compute the mid-point of each class. The purpose of the mid-
point is that it is used as an estimate for all the values in that particular class.

Class Frequency (f) Mid-point (M)


0 up to 5 4 2.5
5 up to 10 15 7.5
10 up to 15 27 12.5
15 up to 20 18 17.5
20 up to 25 6 22.5
Total (n) 70

∑ 𝑓𝑀
𝑥̅ =
𝑛
𝟒(𝟐.𝟓)I𝟏𝟓(𝟕.𝟓)I𝟐𝟕(𝟏𝟐.𝟓)I𝟏𝟖(𝟏𝟕.𝟓)I𝟔(𝟐𝟐.𝟓)
= 𝟕𝟎
𝟗𝟏𝟎
= 𝟕𝟎
= 13 kilometres

5.2 Variance and Standard Deviation for Grouped Data

The formulas used for calculating sample variance and standard deviation for grouped data are:
∑ +(xgQ̅ )j
Sample variance 𝑠L = &gd

∑ +(xgQ̅ )j
Sample standard deviation 𝑠=k &gd

48
Example 3.9
We refer back to the data in Example 3.8. Compute the sample standard deviation for the
grouped data.
Class Frequency (f)
0 up to 5 4
5 up to 10 15
10 up to 15 27
15 up to 20 18
20 up to 25 6
Total 70

Solution:
For grouped data, the mid-point for each class is first computed.

Class Frequency (f) Mid-point (M) z


Mean, 𝒙 𝒇(𝑴 − z𝒙)𝟐
0 up to 5 4 2.5 13 441.00
5 up to 10 15 7.5 13 453.75
10 up to 15 27 12.5 13 6.75
15 up to 20 18 17.5 13 364.50
20 up to 25 6 22.5 13 541.50
Total 70 1807.50
∑ 𝑓(𝑀 − 𝑥̅ )L
𝑠=}
𝑛−1

dYVU.XV
= UVgd
= 5.118 kilometres

6. Empirical Rule

For a bell-shaped distribution, approximately


• 68% of the observations lie within one standard deviation of the mean.
• 95% of the observations lie within two standard deviations of the mean.
• 99.7% of the observations lie within three standard deviations of the mean.

Figure 3.5 Illustration of the empirical rule

49
Example 3.10
A set of 200 observations has a mean (µ) of 100 and a standard deviation (s) of 10. If the
distribution is symmetrical, approximately how many observations should be found in the
interval 80 to 120?

Solution:
The values 80 to 120 is approximately ± 2s from the mean of 100 obtained as follows:
𝜇 ± 2𝜎 = 100 ± 2(10) = 80 to 120

According to empirical rule, about 95% of observations fall within 2 standard deviations from
the mean.

Hence, number of observations found in the interval 80 to 120 = 95% X 200


= 190 observations.

Example 3.11
The evaluation score of all employees in a company follows a symmetric bell-shaped curve,
with a mean of 3.8 (out of total score of 5) and variance of 0.3. Harry’s score is 4.0. The
company recognises the top 16% of performing employees as the top talent. Is Harry a top
talent?

Solution:
Applying the empirical rule, we can derive at the following diagram:

The top 16% must achieve a score of (µ + σ) or more.


µ + σ = 3.8 +Ö0.3
= 4.35

Hence, employees with evaluation score of 4.35 and above would be considered top talent.
Since Harry’s score is 4.0 which is less than 4.35, Harry is not a top talent.

50
7. Discussion questions

1. All the students in an advanced mathematics course form a population. Their course
grades are:

92, 96, 61, 86, 79 and 84.

(a) Give the formula for the population mean


(b) Compute the mean course grade.
(c) Is the mean that you computed in (b) a statistic or a parameter? Why?

2. The number of work stoppages in the construction industry for selected months are:

6, 0, 10, 14, 8 and 0

(a) What is the median number of stoppages?


(b) What is the modal number of stoppages?
(c) Is the mode a suitable measure of central tendency for this data set? Explain.

3. Many regular customers of Jinny Group purchased hair packages of varying dollar
amounts . A sample of 8 customers showed the amount (in dollars) of hair packages
purchased by these customers.

200 200 300 400 450 500 550 1000

(a) Calculate the mean, median and mode.


(b) Is the mean, median or mode the preferred measure of central tendency for this
dataset? Explain.
(c) Compute the range.
(d) Besides the range, name one other measure of dispersion.

4. A company has an office in Location A that hired five audit trainees. The monthly
starting salaries were:

$3536 $3173 $3448 $3121 $3622

(a) Compute the population mean.


(b) Compute the population variance.
(c) Compute the population standard deviation.
(d) The company has another office in Location B that hired six trainees. Their
mean monthly salary was $3550 and the standard deviation was $250. Compare
the two groups.

5. The years of service for a sample of seven employees at a telco retail outlet are:

4, 2, 5, 4, 5, 2 and 6.

What is the sample variance and sample standard deviation?

51
6. The net profits of a sample of large importers of household products were organized
into the following table:
Net Profits ($m) Number of importers
2 up to 6 1
6 up to 10 4
10 up to 14 10
14 up to 18 3
18 up to 22 2

(a) What is the table called?


(b) Compute an estimate of the mean net profits.
(c) Compute an estimate of the standard deviation.

7. The shopping expenditure of a population of working executives has a symmetrical


distribution with a mean of $400 and a standard deviation of $25. Using the empirical
rule, state the expenditure range for
(a) 68% of the executives.
(b) 95% of the executives.
(c) 99.7% of the executives.

52
8. Supplementary questions

1. An engineer tested twelve pieces of a certain mechanical component produced at a


factory. The following are the number of hours it took for each component to fail when
the motor was run continuously at maximum output:
12 30 23 21 29 22
21 19 27 16 21 29
Compute the median, mean and mode.

2. The following data represent the weight in kilograms of several randomly selected
batches of dried goods arriving at a port last month.
Weight (kilograms) Number of batches
0 up to 25 5
25 up to 50 23
50 up to 75 8
75 up to 100 6
100 up to 125 4

(a) Compute the mean weight for these data.


(b) Describe the distribution’s skewness, if any.
(c) Which is the preferred measure of central tendency – the mean or the median?
Explain.

3. A set of 60 observations has a mean of 80 with a variance of 16. If the distribution is


symmetrical, approximately how many observations should be found in the interval 76
to 84? Explain using the empirical rule.

4. The following table shows the sales turnover of Dinesh Sundries Store for the month of
July.
Daily Sales ($) Number of Days
0 up to 1000 1
1000 up to 2000 2
2000 up to 3000 3
3000 up to 4000 8
4000 up to 5000 12
5000 up to 6000 5

(a) What is the variable being studied?


(b) Is the data grouped or ungrouped?
(c) Comment on the skewness of the distribution. Would you expect the mean or
the median to be higher? Explain.

5. Consider the following sample data set:


3 2 4 6 1

(a) Find the median and variance.


(b) Suppose that we add the same number to all the values in the dataset. Which
of the two quantities (median or variance) will remain the same. Explain.

53
6. A variable has a unimodal distribution with mean 30 and median of 45. Is the
distribution skewed to the left, to the right or symmetric? Explain and give a rough
sketch of the distribution.

7. At a stock investment seminar, a sample of 6 participants were asked to estimate the


size (in $000) of their current stock investment portfolio. The following results were
obtained:
40 20 30 150 50 25

(a) Compute the mean and median.


(b) Is the mean or median the preferred measure of central tendency for this data
set? Explain.
(c) Compute the sample standard deviation.

8. The following table shows the frequency distribution of the weights of a sample of 55
luggage being checked in on an airplane by passengers travelling in the premium
economy class.
Weight (kg) Frequency
5 up to 15 5
15 up to 25 7
25 up to 35 19
35 up to 45 17
45 up to 55 7

(a) Compute the mean.


(b) Compute the variance.
(c) Compute the standard deviation.

9. The following data shows the years of service of all the teachers in a private school.

2, 6, 4, 12, 20, 8, 12, 2

(a) Is the data set a sample or a population?


(b) Compute the mean and the standard deviation.

10. Crystal, a secondary school student took 5 quizzes last week. The scores are listed
below:
8 98 24 18 27

(a) Compute the mean.


(b) Is the mean a fair representation of central location for this dataset? Explain.

54
BUSINESS STATISTICS

SESSION 4

USE OF EXCEL FOR DATA ANALYSIS 1

At the end of the session, students should be able to:

1. input collected data in Excel to facilitate data analysis.


2. generate data using formulas and basic functions.
3. apply excel functions/commands for data analysis.

_______________________________________________________________________

1. Introduction

Excel is a powerful application that offers a large number of functions, tools and options for
use. With knowledge of Excel, you can organize and manipulate data, perform computations
as well as create charts. This would allow you to conduct better data analysis and assist you in
decision making. As such, Excel has been widely used in many organisations today.

In this session, you will be introduced to several functions that would be useful for you to
analyse large data sets.

2. USING EXCEL BUILT-IN FUNCTIONS AND COMMANDS

2.1 Descriptive Statistics functions

Here we cover some of the commonly used commands for descriptive statistics, for example:

o Sum
o Average (mean)
o Median
o Mode
o Count
o Minimum
o Maximum
o Sample standard deviation
o Frequency distribution

55
Example 4.1
Input the following data into cells A1 to A10 which show the ages of a sample of persons in a
tour group:

Based the raw data, we can use some useful statistical functions in Excel to perform required
tasks.

Task Syntax Excel Function/Command


Sum up all values =SUM(number1, number2….) =SUM(A1:A10)
Find the sample
=AVERAGE(number1, number2….) =AVERAGE((A1:A10)
mean value
Find the median
=MEDIAN(number1, number2….) =MEDIAN(A1:A10)
value
Find the modal value =MODE(number1, number2….) =MODE (A1:A10)
Find the minimum
=MIN(number1, number2….) =MIN(A1:A10)
value
Find the maximum
=MAX(number1, number2….) =MAX(A1:A10)
value
Count total number
=COUNT(value1, value2….) =COUNT(A1:A10)
of values
Count number of
=COUNTIF(range, criteria) =COUNTIF(A1:A10,”>20”)
values exceeding 20
Compute sample
=STDEV.S(number1, number2…..) =STDEV.S(A1:A10)
standard deviation
Set up frequency
=FREQUENCY(data_array,bin_array) See part 2.2
distribution

56
The answers are shown below:

2.2 Creating a Frequency distribution

Based on the raw data of sample ages in Example 4.1, we now wish to set up a frequency
distribution using 3 classes.

The lower and upper limits must be keyed into separate cells.

The raw data is at range A1:A10. We can now create the command to fill up the frequency
column from cells D14:D16.

Steps:
o Select the cell range D14:D16 (note: need to highlight the WHOLE range)
o The command syntax is as follows:
=FREQUENCY(data_array,bin_array)
where
data array refers to the raw data (A1:A10) and
bin array refers to the upper limits (C14:C16).
o Next, type the following command:
=frequency(A1:A10,C14:C16)
o To execute the command, press CTRL-SHIFT-ENTER
(note: if you simply hit ENTER key, an error will occur)
o You can sum up the frequency column using =SUM(D14:D16) and displaying the
answer at Cell D17.

57
The final output is shown below:

2.3. Histogram

Example 4.2
Create a histogram based on the frequency distribution below:
Hours spent studying No. of Respondents (frequency)
0 up to 10 3
10 up to 20 2
20 up to 30 5

To obtain a histogram, select the Range of frequency counts and then select Insert ->Chart-
>Column. A column chart will be generated.

Double-click the bars and select Format Data Series to Reduce the Gap Width to 0%. Add
suitable title and axis labels to the histogram using Chart Design/Tools function

58
2.4 Pivot Tables and Charts

The Pivot Table feature of Excel is useful and versatile. This tool makes it possible to
summarise your raw data into more informative tabulations.

Example 4.3
The raw data show the favourite colour provided by 15 children. Input the values to Excel.

To create a PIVOT table, select the entire range (K1:K16) including the header.
Then select Insert, Pivot Table.

In the resulting Pivot Table framework drag “colour” item to the Row Labels area and then
Drag the “colour” item into the value area:

59
Next, Right click in the value area, to change the value “Field setting” to Count since we are
doing a count of the number of children in each category.

The completed frequency table appears as follows :

To obtain a pie chart, simply highlight the range A4:B7 then select Insert Chart/Pie.
The final pie chart is shown below:

PIE CHART

Blue Green Red Yellow

You may add title, data labels to the chart using Chart Design/tools and change accordingly.

Example 4.4
The pivot table command can also be used for two variables to form a cross-tabulation.
Suppose we now have the data for 2 variables – Gender and favourite Colour.

60
We want to do a cross tabulation of Gender (row variable) versus Colour (column variable).
The steps are :
o Select the entire range J1:K16 including the headers.
o Select Insert, Pivot Table
o Drag “GENDER” item to the Row Labels area.
o Drag “COLOUR” item to the Column Labels area.
o Drag either “GENDER” or “COLOUR” to the Values Area
o For the Values Area, change field setting to Count if required.

The final cross tabulation (also known as a contingency table) is shown below:

61
2.5 VLOOKUP (Vertical look up) Function

VLookup is an Excel function to look up and retrieve data from a specific column in a table.

Example 4.5
In this example, we wish to provide a comment (Good, Average and Unsatisfactory) depending
on the scores obtained by a group of students as shown in the table below:

We shall now use the vlookup function to insert the comments for the various scores obtained
by 10 persons.

The table to look up is


found in the range D2: G4

The value to be displayed will


be the value at Column 4 of
the table

Look-up value which is the


score of 90 found in cell A2

The syntax for this function is: =VLOOKUP(lookup_value, table_array, col_index_num)

Our command would be =VLOOKUP(A2, D2:G4, 4). The final answers are shown below:

62
EXCEL LAB EXERCISE 1

1. The sample of 10 observations shows the monthly wages of part-time workers:


752 576 583 691 480 752 814 681 580 373
Find the :
(a) mean
(b) median
(c) mode
(d) minimum
(e) maximum
(f) sample standard deviation

2. The 3 most popular type of burgers purchased by 15 customers over the last one hour
are as follows:
Fish Chicken Vegetable Vegetable Chicken
Chicken Fish Chicken Chicken Chicken
Fish Chicken Vegetable Fish Fish

(a) Set up a frequency table. [use Pivot Table function]


(b) Based on the frequency table from Part (a), create a pie chart and a bar chart.

3. The following data shows the gender and the brand of cars purchased by a sample of
20 new car buyers.
GENDER BRAND GENDER BRAND
Male Honda Male Toyota
Male Nissan Female Toyota
Male Toyota Male Nissan
Female Nissan Female Honda
Male Honda Male Toyota
Female Nissan Female Mazda
Female Mazda Male Toyota
Female Toyota Female Mazda
Male Nissan Male Honda
Male Toyota Male Mazda

Using GENDER as the row variable and BRAND as the column variable, create a
contingency table. [use Pivot Table function]

4. The customer satisfaction ratings received by a restaurant are classified as Good,


Average or Poor as shown in the table below:
RATING DESCRIPTION
1 to 4 Poor
5 to 7 Average
8 to 10 Good

The satisfaction ratings received from 10 customers of a restaurant are as follows:


9 8 8 7 7 6 5 7 8 3
Use the VLOOKUP function to classify the rating of the 10 customers as Good, Average
or Poor.

63
EXCEL LAB EXERCISE 1 (ANSWERS)

1.

2.
Count of
Row Labels ITEM
Chicken 7
Fish 5
Vegetable 3
Grand Total 15

Chart Title

Chicken Fish Vegetable

8
7
6
5
4
3
2
1
0
Chicken Fish Vegetable

64
3.
Count of
BRAND Column Labels
Row Labels Honda Mazda Nissan Toyota Grand Total
Female 1 3 2 2 8
Male 3 1 3 5 12
Grand Total 4 4 5 7 20

4.
RATING DESCRIPTION
9 Good
8 Good
8 Good
7 Average
7 Average
6 Average
5 Average
7 Average
8 Good
3 Poor

65
EXCEL LAB EXERCISE 2
(with step- by- step guide)
Joe has requested that you carry out an analysis of the used car market using the following data:

1. Create the data from the above table using EXCEL.

2. Compute the minimum, maximum, mean, median and standard deviation for the resale price.

3. Create a frequency distribution table, relative frequency and cumulative frequency based on the
resale price. You are required to group the data into appropriate classes using class interval of
$22,000. Use $40,000 as lower limit of the first class.

4. Create a pie chart to show the proportion of vehicles under the different categories.

5. Create a histogram for resale price using the classes obtained in part (3).

6. Create a scatter plot with the horizontal axis as Category variable and the vertical axis as the
resale price variable.

7. Create a contingency table using Category as the row variable and No of Owners as the column
variable.

8. Create a Pie Chart for variable Age with 3 categories namely “NEW” (for cars that are 1 to 3
years old), “AVERAGE (for cars that are between 4 to 6 years old) and “OLD” (for cars that are
between 7 to 10 years old). [Use VLOOKUP command]

66
PART 1: CREATING THE DATA

Enter the data in excel as follows:

PART 2: COMPUTING MINIMUM, MAXIMUM, MEAN, MEDIAN AND STANDARD DEVIATION

Assume that your data for Resale Price is from Cells D2 to D21, you can click on the fx icon to find the
values for the Minimum, Maximum, Mean Median and Standard Deviation.

Minimum
=MIN(D2:D21)

Maximum
=MAX(D2:D21)

Mean
=AVERAGE(D2:D21)

Median
=MEDIAN(D2:D21)

Standard Deviation
Sample Standard Deviation:
=STDEV.S(D2:D21) OR

Population Standard Deviation:


=STDEV.P(D2:D21)

PART 3: CREATING A FREQUENCY DISTRIBUTION

Step 1 : Set up classes


=A34+ 21999.99
Create the following table starting from cell A33:

=A34+ 22000

Classes
Go to Cell A34 and enter 40000 as your lower class limit.

For the upper class limit you can use (Lower class limit + Interval value) at cell C34 as follows:
=A34+21999.99
Complete the rest of the table (you may use COPY command) and format all numbers to WHOLE
numbers.

67
Step 2 : Creating a Frequency Column

o Go to cell D33 to create a new column for Frequency.

o Select cells D34:D38 (NOTE: Need to highlight the WHOLE range).

o You will be using the following Command Syntax


=Frequency(data array, bin_array) i.e. =Frequency(D2:D21,C34:C38)

o Your final command should appear as follows: ={Frequency(D2:D21,C34:C38)}

Important: After selecting the data array and bin array for the frequency command, press
CTRL-SHIFT-ENTER to execute the command. (all frequency counts will automatically appear in the
relevant cells)

Add in the Total Frequency, Relative Frequency, Cumulative Frequency and Mid points

o Create the Total Frequency at Cell D39 by using the following formula:
Example: =SUM(D34:D38)

o You can compute the Relative Frequency by dividing each frequency count by the frequency
total.
Example: Enter the following at Cell E34
=D34/$D$39

o Cumulative Frequency can be computed by adding the value of the previous frequency count.
At Enter command
Cell F34 = D34
Cell F35 =+F34+D35
Cell F36 =+F35+D36
and so on

o Mid points can be computed using (Lower Limit + Upper Limit)/2 e.g (A34+C34)/2.

PART 4: Draw a pie chart to show the proportion of vehicles under the different
categories.

Highlight the category column and select Insert Pivot Table.

You will see a Pivot Table builder framework.

68
First you should click and drag
the Category in the box under
the Field name and place it in
the Rows area below.
After that click and drag the
Category in the box under the
Field name again and place it in
the Values area below.

Click to change Field settings to


Count

Select count when the following option box appear.

A frequency table will be generated. Amend the row labels to Category1, Category 2 and Category 3.

To create a pie chart, select data by highlighting the range A5:B7 then
Select Insert Chart/Pie and choose 3-D Pie tab to get the following chart.

69
Adding Percentages to the Pie Chart

You can add the percentages to the Pie Chart by placing cursor on your pie chart and right click. Select
Add Data Label. Next right click again and select Format Data Label and check the Percentage option.

PART 5: Create a Histogram for Resale Price

Select the frequency column data at D34:D38, then click Insert chart and select Column chart.

Select any column and Right-click. Select “Format data series”/Options and change gap width to 0.

To change x-axis labeling to class midpoints, Click on the Chart, Select Data and enter G34:G38 for
Category (X) axis labels as shown below:

70
PART 6: Create a scatter plot using Category variable on the horizontal axis and Resale Price
variable on the vertical axis

Highlight the range of two columns of data where you would want to plot one variable against another.
(Note: DO NOT select the variable names). Click Insert/Chart/Scatter to get the following chart.

You can include axis titles by selecting the Chart Design, Add Chart Elements on the menu as follows:

71
PART 7: Create a Contingency Table

Select the data in the Category and No of Owners (including the headers). Note : It is alright for the
range to include more than 2 columns e.g. A1:F21. Select Insert Pivot Table to obtain the following:

Click OK and then proceed to use Pivot Table function to generate a contingency table with Category
for the Row and No of Owners for the Column.

Click and drag the field name Category to Row area box below.

After that, click and drag field name No of Owners place it in the Columns area box.

Next drag the Category (you can use No of Owners too) into the Values box and amend the setting to
Count of Category. See chart below.

Change the row labels to Category 1, Category 2 and Category 3. The final output will appear as
follows:

72
PART 8: Using VLOOKUP command

First, create the following table at Cells J4:M6

Column 4

Go to Cell G1 to create a new variable called AgeGroup. Now, use the following command at Cell G2
:
=VLOOKUP(F2,$J$4:$M$6,4)
Frequency distribution range

The command allows you to check the Age value in F2 against the lower class value of the classes
from $J$4 to $M$6 and assign the respective Classification Value on Column 4 to G2.

Copy cell G2 to the rest of the rows from G3 to G21.

Use the PIVOT table command to generate a PIE chart (with 3 segments) for AgeGroup.
Please refer to the steps in Part 4 to create the PIE chart.

73
BUSINESS STATISTICS

SESSION 5

PROBABILITY

At the end of the session, students should be able to:

1. define basic terms used in probability, namely experiment, outcome/event and sample
space.
2. understand mutually exclusive events and independent events.
3. use Venn Diagrams for computing probabilities
4. understand and apply basic addition and multiplication rules and special addition and
multiplication rules in computing probabilities
5. understand and apply conditional probabilities
6. apply Bayes’ theorem and draw Tree Diagrams.
7. compute mean, variance, standard deviation of a Discrete Probability Distribution.
_________________________________________________________________

1. Introduction

A probability is a measure of the chance that an event will happen. Its value ranges from 0 to
1. If the probability of an event is 1, the event will surely happen. If the probability of an event
is 0, the event will never happen.

Probability is everywhere. In weather forecast, we may ask “What is the chance of rain”? In
investment decisions, we may ask “What is the probability of earning at least 10% on this
investment?”. In a volleyball match, a fan may ask “What is the chance of Team A winning
the match?”

If people had perfect information about the future as well as the present and the past, there
would be no need for decision makers to consider the concepts of probability. However, since
we cannot eliminate uncertainty from our lives, we need to recognize its presence and use
probability concepts in the process of making decisions.

2. Definitions of Probability

There are 3 ways to define probability:

Classical Probability
The classical probability rule is applied to compute probabilities of events for an experiment
where all outcomes are equally likely.

Example 5.1
Find the probability of obtaining a Head and the probability of obtaining a Tail for one toss of
a coin.

74
Solution:
P(Head) = 1/Total number of outcomes = ½
P(Tail) = ½

Empirical Probability
The empirical definition applies when the number of times the event happens is divided by the
number of observations. In such cases, to calculate probabilities we either use past data or
generate new data by performing the experiment a large number of times. The relative
frequency of an event is used as an approximation for the probability of that event.

Example 5.2
Ten of the 500 randomly selected components produced at a certain factory are found to be
defective. What is the probability that the next component manufactured at this factory is
defective?

Solution:
We can list the frequency and relative frequency for this example

Component Frequency Relative frequency


Good 490 490/500 = 0.98
Defective 10 10/500 = 0.02
Total 500 1.00

From the relative frequency column, the probability that a component is defective is 0.02. This
is an approximate probability. However, if the experiment is repeated again and again, the
approximate probability of an event will approach the actual probability. This is called the Law
of Large Numbers.

Subjective Probability
Subjective probability is based on whatever information is available. It is based on the
individual’s own judgment, experience, information and belief. A soccer player may assign a
high probability to the chance of the team winning a game whereas the coach may assign a low
probability to the same event.

3. Terms Used in Probability

We examine some key terminologies used in the language of probability.

Experiment
This refers to any procedure or process that yields a result or an observation.

Event
This is the collection of one or more outcomes of an experiment.
- Events are mutually exclusive if the occurrence of any one event means that none of the
other events can occur at the same time.
- Events are independent if the occurrence of one event does not affect the occurrence of
another.
- Events are collectively exhaustive if at least one of the events must occur when an
experiment is conducted.

75
Outcome
This is the particular result of an experiment, in other words, what actually happens.

Sample Space
This is the set of all possible outcomes for an experiment. The sample space is typically called
S. e.g. S = {1, 2, 3, 4, 5, 6} when we roll a die.

Example 5.3
A white die and a black die are each rolled once and the number of dots showing on each die is
observed. We then add up the total number of dots on both dice. State the experiment, outcome
of interest and the sample space.

Solution:
Experiment Each die is rolled once.
Outcome of Interest Sum of the number of dots that face up on
both dice
Possible Outcomes
S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
(Sample Space)

Example 5.4
An experiment consists of drawing one marble from a container that contains a mixture of red,
blue and yellow marbles. State the experiment, outcome of interest and the sample space.
Provide an example of an event.

Solution:
Experiment Drawing one marble from a container
containing red, blue and yellow marbles
Outcome of Interest Color drawn
Possible Outcomes (Sample Space) S = {red, blue and yellow}
Event example Color drawn is not Blue

4. Venn Diagrams

A Venn diagram is an illustration that utilizes circles, either overlapping or non-overlapping,


to depict a relationship between sets of elements. The sample space is represented by the
rectangle. See Figure 5.1.

Figure 5.1 Venn Diagram

76
Examples of venn diagrams depicting various relationships between events are shown below:

Mutually Exclusive Events

Figure 5.2: Venn Diagram – Mutually Exclusive events

The two events A and B are mutually exclusive as they have no common outcomes. Hence,
P(A and B) = 0.

Intersection

Figure 5.3: Venn Diagram – Intersection

The intersection (shaded area Figure 5.3) of the two events A and B is denoted by A ÇB. This
means that both events A and B can occur concurrently.

Union

Figure 5.4: Venn Diagram – Union

The Union of two events A or B is denoted by (AÈ B) . This means either Event A or Event B
has occurred, that is, equivalent to either A or B or both. The events are not mutually exclusive
i.e. P(A and B) ¹ 0.

77
Complement

Figure 5.5: Venn Diagram - Complement

The complement rule is used to determine the probability of an event occurring by subtracting
the probability of the event not occurring from 1. The complement of A is denoted by ~A.

Hence, we have
P(A) + P(~A) = 1 or P(~A) = 1 - P(A)

5. Basic Rules of Probability

Figure 5.6: Probability Rules

5. 1 Addition Rule

5.1.1 Non-mutually exclusive events

If A and B are two events that are not mutually exclusive, then P(A or B) is given by the
following formula:
P(A or B) = P(A) + P(B) – P(A and B)

This is known as the General Rule of Addition.

78
Example 5.5
Students in a school are taking swimming and tennis lessons in the following proportions:

Swimming (S) : 64%


Tennis (T) : 35%
Swimming & Tennis : 20%

(a) Find the probability that a student takes either swimming or tennis lessons?
(b) Draw the venn diagram.

Solution(a):
P(S or T) = P(S) + P(T) - P(S ÇT)
= 0.64 + 0.35 – 0.2
= 0.79

Solution(b):

5.1.2 Mutually exclusive events

If Event A and Event B are mutually exclusive, then P(A Ç B) = 0 since the events cannot occur
together. Then, P(A or B) will be:
Equals 0, thus we leave this out
from the formula.
P(A or B) = P(A) + P(B) - P(A and B)

Hence, P(A or B) = P(A) + P(B)

This is known as the Special rule of Addition.

Example 5.6
The following data were collected on SQA Airlines about their flights from Destination Y to
Destination Z over the last six months.

Arrival Event Frequency


Early A 200
Late B 150
On Time C 1600
Cancelled D 50
Total 2000

Find probability that a flight is either early or late.

79
Solution:
P(A or B) = P(A) + P(B)
LVV dXV
= LVVV
+ LVVV

= 0.175

5.2 Multiplication Rule

The probability that events A and B happen together is called the joint probability of A and B
and is written as P(A and B).

5.2.1 Independent events

Two events are independent if the occurrence of one event does not affect the occurrence of the
other. The formula is written as:

P(A and B) =P(A) x P(B)

This is known as the Special rule of Multiplication.

Example 5.7
You throw a coin twice. What is the probability of getting a Head for both throws?

Solution:
d d d
P(1st H and 2nd H) = L 𝑋 L
= W

Example 5.8
Richard has recently purchased two stocks, Wilma and Gending. The probability that Wilma
stock will increase in value next year is 0.5 and the probability that Gending stock will increase
in value next year is 0.7. Assume that the two stocks are independent.
(a) Find the probability that both stocks increase in value next year.
(b) Find the probability that at least one of these stocks increase in value next year.

Solution(a):
Let W = Wilma increases in value and G = Gending increases in value

P(W and G) = P (W) X P(G)


= 0.5 X 0.7
= 0.35

Solution(b):
P (at least one increases in value)
= 1 – P (both did not increase in value)
= 1 – [P (~W and ~G)]
= 1- (0.5 X 0.3)
= 0.85

80
5.2.2 Dependent events

When events are not independent, the joint probability P(A and B) is given by the following
formula:
P(A and B) = P(A) X P(B|A)

This is known as the General rule of Multiplication.

P(B|A) is known as the conditional probability. It is the probability of a particular event


occurring given that another event has occurred.

It is obvious from the multiplication rule for dependent events that if we know the probability
of events A and B, then we can calculate the conditional probability of B given A if required.

𝑃(𝐴 𝑎𝑛𝑑 𝐵) 𝑃(𝐴 𝑎𝑛𝑑 𝐵)


𝑃(𝐵|𝐴) = 𝑎𝑛𝑑 𝑃(𝐴|𝐵) =
𝑃(𝐴) 𝑃(𝐵)

Example 5.9
There are five coloured balls in a box namely 3 yellow (Y) balls and 2 red (R) balls. Two balls
are drawn successively without replacement.
(a) What is the probability that both balls are yellow?
(b) What is the probability that the two balls are of different colours?

Solution(a):
T L T
P(Y1 and Y2) = X 𝑋 W
= dV

Solution(b):
P(Different colours) = P(Y1 and R2) or P(R1 and Y2)
T L L T
= (X 𝑋 W) + (X 𝑋 W)

T
=X

6. Contingency Table

A contingency table is a table used to classify sample observations according to two categorical
variables.

Example 5.10
The President of a university has proposed that all students take a course in ethics as a
requirement for graduation. Faculty members and students from this university were asked
about their opinion on this issue. The results are presented in the table below:

OPINION Student(S) Faculty(F) Total


Agree (A) 170 110 280
Disagree (D) 120 100 220
Neutral (N) 310 190 500
Total 600 400 1000

81
(a) Given that a randomly selected person is a faculty member, what is the probability that
the person agrees with the proposal?

Solution(a):
This is conditional probability and can be calculated without using the formula.
Note that there are 400 faculty members out of which 110 agree with the proposal.
ddV
Hence, P (A|F) = WVV = 0.275

(b) Find the probability that a randomly selected person is a faculty member and agrees
with the proposal.

Solution(b):
This is a joint probability which refers to the intersection of two events. Without the use of the
multiplication rule, we note that the total sample size is 1000 out of which the events “faculty”
and “agree” intersect at the value 110.
ddV
Hence, P(F and A) = dVVV = 0.11

The alternative solutions for parts (a) and (b) using formulas are shown below:

Solution(a):
„(… .&† !)
P(A|F) = „(!)
ddVD ddV
dVVV
= WVV/dVVV = WVV = 0.275

Solution(b):
P(F and A) = P(F) X P(A|F)
WVV ddV ddV
= dVVV 𝑋 WVV = dVVV = 0.11

(c) Find the probability that a person selected at random from these 1000 persons is a
faculty member or agrees with the proposal.

Solution(c):
Using the addition rule for events that are not mutually exclusive, we have
P(F or A) = P(F) + P(A) – P(F and A)
WVV LYV ddV XUV
= dVVV
+ dVVV − dVVV = dVVV
= 0.57

(d) Find the probability that a person selected at random from these 1000 persons is either
neutral about or agree with the proposal.

Solution(d):
Using the addition rule for mutually exclusive events, we have

P(N or A) = P(N) + P(A)


XVV LYV UYV
= dVVV
+ dVVV = dVVV
= 0.78

82
7. Bayes Theorem

Bayes’ Theorem is applied to revise probabilities of events after we obtain more information.
It is computed using the following formula:
𝑃(𝐴d )𝑃(𝐵|𝐴d )
𝑃(𝐴d |𝐵) =
𝑃(𝐴d )𝑃(𝐵|𝐴d ) + 𝑃(𝐴L )𝑃(𝐵|𝐴L )

Example 5.11
70% of working people like coffee and 30% like tea. Of those who drink coffee 90% added
sugar while 40% of those who drink tea added sugar.
% of People % who added sugar (S)
Coffee (C) 70% 90%
Tea (T) 30% 40%
A person added sugar to his drink. What is the probability that the drink is coffee?

Solution:
Given :
P(C) = 0.7 P(S|C) = 0.9
P(T) = 0.3 P(S|T) = 0.4
𝑃(𝐶)𝑃(𝑆|𝐶)
𝑃(𝐶|𝑆) =
𝑃(𝐶)𝑃(𝑆|𝐶) + 𝑃(𝑇)𝑃(𝑆|𝑇)

0.7 𝑋0.9
= = 0.84
(0.7𝑋0.9) + (0.3𝑋0.4)

7.1 How Bayes formula is derived

A sample space, S, can be partitioned into 2 mutually exclusive events, A1 and A2.

B is an event in the sample space, S.

Therefore, P(B) = P(A1ÇB) + P(A2ÇB)

Applying the multiplication rule, we get


P(B) = P(A1) x P(B|A1) + P(A2) x P(B|A2)

83
The conditional probability formula (see section 5.2.2) is given by:
𝑃(𝐴d 𝑎𝑛𝑑 𝐵)
𝑃(𝐴d |𝐵) =
𝑃(𝐵)

Substituting our earlier derivations, we will get


𝑃(𝐴d )𝑃(𝐵|𝐴d )
𝑃(𝐴d |𝐵) =
𝑃(𝐴d )𝑃(𝐵|𝐴d ) + 𝑃(𝐴L )𝑃(𝐵|𝐴L )

This is the Bayes formula.

8. Tree Diagrams

A tree diagram is useful for portraying conditional and joint probabilities. It is particularly
useful for analyzing business decisions involving several stages.

Example 5.12
Draw a tree diagram based on the information from Example 5.11

Solution:

1.00

- Circles are known as nodes and the lines are known as branches.
- The probabilities of each group of branches radiating from each node add up to 1.
- The joint probability is obtained by multiplying the simple probability with the
conditional probability.

9. Counting Rules

9.1 Multiplication Formula

The multiplication formula indicates that if there are m ways of doing one thing and n ways of
doing another thing, then there are m X n ways of doing both.

Example 5.13
John takes a sandwich every morning. He can choose white bread or wholemeal bread together
with tuna, chicken or scrambled egg. How many possible ways are there for him to make his
sandwich?
Solution:
2 X 3 = 6 ways

84
9.2 Permutation

A permutation is any arrangement of r objects selected from n possible objects. The order of
arrangement is important in permutations.
n
&!
Pr =
(&g")!
where n = total number of objects
r = number of objects selected

Example 5.14
Find the number of ways to arrange 5 paintings on an exhibition wall. The 5 paintings are
chosen from a set of 7. The order of arrangement of these paintings is important.

Solution:
7
U!
P5 = = 2,520
(UgX)!

9.3 Combination

A combination is the number of ways to choose r objects from n objects without regard to
order.
n
&!
Cr =
"!(&g")!

where n = total number of objects


r = number of objects selected

Example 5.15
How many ways are there to choose 3 desserts from a menu of 10 desserts?

Solution:
10
dV!
C3 = = 120
T!(dVgT)!

10. Discrete Probability Distribution

A random variable X associates a numerical value with each outcome of an experiment. A


random variable is said to be discrete if it has a finite number of values.

A discrete probability distribution is a listing of mutually exclusive and exhaustive outcomes


of a random experiment and the corresponding probabilities.

A discrete random variable is usually denoted by X. The probability distribution of a discrete


random variable X, often denoted by p(x).

85
10.1 Characteristics of a Discrete Probability Distribution

The characteristic of a discrete probability distribution are :


§ Probabilities are between 0 and 1.
§ Outcomes are mutually exclusive events.
§ List is exhaustive, hence sum of the list of probabilities =1.

Mean (Expected Value) of a discrete random variable


It is the weighted average of all possible outcomes according to the probabilities of occurrences.
µ = å [x.P( x)]

Variance of a discrete random variable


It is a weighted average of the squared differences between the values of a random variable and
its mean.
[
s 2 = å (x - µ )2 .P( x) ]
Example 5.16
Dan, owner of College Painters, studied his records for the past 20 weeks and reported the
following number of houses painted per week.

Number of houses painted Weeks


10 5
11 6
12 7
13 2
Total 20

(a) Construct the probability distribution.

Solution(a):
Number of houses painted (x) P(x)
10 5/20 = 0.25
11 6/20 = 0.30
12 7/20 = 0.35
13 2/20 = 0.10
Total 1.00

(b) Compute the mean and variance.

Solution(b):

µ = å [x.P( x)]

= 10(0.25) + 11(0.3) + 12(0.35) + 13(0.1) = 11.3 houses

[
s 2 = å (x - µ )2 .P( x) ]
= (0.25)(10-11.3) +(0.3)(11-11.3)2 + (0.35)(12-11.3)2 + (0.1)(13-11.3)2
2

= 0.91

86
(c) Find the probability of x > 10.

Solution(c):
P(x>10) = P(x=11) + P(x=12) + P(x=13)

= 0.3 + 0.35 + 0.1 = 0.75

(d) Find the probability of x < 10.

Solution(d):
P(x<10) = 0 (from the probability distribution, there are no values below 10)

87
11. Discussion questions

1. A sample of employees of WE Co will be surveyed about a new health plan. The


employees are classified as follows:

Classification Event Number of Employees


Marketing A 17
Production B 46
Finance C 30
Research D 7
TOTAL 100

What is the probability that a person selected is


(a) either in Marketing or in Research?
(b) not in Finance

2. Routine physical examinations in a company discovered that 8 percent of employees


need corrective shoes, 15 percent need major dental work and 3 percent need both
corrective shoes and major dental work.
(a) What is the probability that an employee selected will need either corrective
shoes or major dental work?
(b) Draw a venn diagram.

3. An urn has 4 red and 5 blue marbles. 3 marbles are drawn without replacement. Find
the probability that
(a) All marbles are blue.
(b) The first is red and the other two are blue.

4. A car dealer advertised that for $99,999 you can buy a Model X, Y or Z car with a
choice of either leather seats or fabric seats. How many different arrangements of
models and seat types can the dealer offer?

5. There are 12 players in a basketball team. The Coach wants to pick five players
among the twelve.
(a) How many different groups are possible?
(b) Suppose that in addition to selecting the group, he must also rank each of the
players in that starting lineup according to their ability. How many groups are
possible?

6. The soccer team of a junior college plays 70 percent of their games at night and 30
percent during the day. The team wins 50 percent of their night games and 90 percent
of their day games. According to today’s newspapers, they won yesterday. What is the
probability the game was played at night?

88
7. BBS Clifford Branch has 800 customers. 240 of these customers have housing loans.
Of these 240 customers, 120 customers also own a credit card issued by the bank.
Altogether, 500 customers own a credit card issued by the bank.

(a) Construct a contingency table using the following format:

Housing loan No Housing Loan Total


Credit card
No credit card
Total

(b) Find the probability that a selected customer owns a credit card issued by the
bank.
(c) Given that a selected customer owns a credit card issued by the bank, what is
the probability that the customer also has a housing loan with the bank?
(d) Find the probability that a selected customer does not have a housing loan and
does not own a credit card issued by the bank.

8. ABC bank has three key managers who help manage customers’ investment portfolios.
These managers try to make profits for customers; sometimes losses are incurred. The
data from a sample of 300 customers show the following:

Outcome
Incurred Losses Made Profits TOTAL
Manager A 45 90 135
Manager B 15 110 125
Manager C 10 30 40
TOTAL 70 230 300

(a) Find the probability that a customer selected at random made profits.
(b) Find the probability that a customer selected at random belongs to Manager A
or has incurred losses.
(c) Find the probability that a customer selected at random belongs to Manager C
and has made profits.
(d) Given that the customer selected belongs to Manager C. What is the probability
that the customer made profits?
(e) Given that the customer incurred losses, what is the probability that he/she
belongs to Manager A?

9. The number of cars sold by a salesman on a typical day is given by the following
distribution:
No of cars sold Probability
0 0.5
1 0.3
2 0.2

(a) Compute the mean, variance and standard deviation.


(b) Find the probability that at least 1 car was sold.

89
12. Supplementary questions

1. Two fair dice are thrown. What is the probability of at least one odd number? What
is the probability of this if four dice are thrown?

2. A university graduate club wants to attract more fresh graduates to join the club. A
survey was conducted amongst fresh graduates recently. 67% indicated that gaming
facilities were influential in their decision to join, 44% felt that dining facilities were
influential, and 21% felt that both were influential factors. Using the results of the
survey as the probabilities for a potential new member, what is the probability that the
potential new member would consider neither gaming nor dining facilities as influential
factors when making a decision to join the club.

3. An insurance company reported the following experience with damage-only claim for
automobile accidents

Numbers of damage-only claims


Age
Under 26 26 or older Total
Claim 1053 1020 2073
No Claim 4139 6087 10226
Total 5192 7107 12299

(a) What is the probability that a person selected at random from this group of
insured made a claim?
(b) What is the probability that a person selected at random is “26 or older” or made
no claim?
(c) What is the probability that a person made a claim given that this person is under
age of 26?

4. Suppose you are eating at Burger King with two friends. You have agreed to the
following rules on who should pay the bill. Each person will toss a coin. The person
who gets a result that is different from the other two will pay the bill. If all three tosses
yield the same result, the bill will be shared by all.

Find the probability that

(a) only you will have to pay the bill.


(b) The bill will be shared among three of you.

90
5. Given the following contingency table summarized by a company with 3 suppliers:

Delivery Outcome
Early On Time Late TOTAL
Jones 20 20 10 50
Smith 10 90 50 150
Robinson 0 10 90 100
TOTAL 30 120 150 300

Find the following probabilities associated with a delivery chosen at random?


(a) Being an early delivery.
(b) Being a delivery from Smith.
(c) Being both from Jones and late.

6. Two production lines contribute to the total amount of a company’s products. Line A
provides 30% of the total products, and 15% of Line A’s products are defective. Line B
provides the remaining, and 5% of Line B’s products are defective.

Suppose a defective item was randomly selected from the total products. What is the
probability that this item was produced by Line A?

7. A new sports car model has defective brakes 15 percent of the time and a defective
steering mechanism 5 percent of the time. Let’s assume (and hope) that these problems
occur independently. If one or the other of these problems is present, the car is called a
“lemon”. If both of these problems are present, the car is a “hazard”. Your instructor
purchased one of these cars yesterday.

What is the probability it is


(a) a lemon?
(b) a hazard?

8. Three defective electric shavers were accidentally shipped to a store by Clean-shave


Products along with 17 non-defective ones.

(a) What is the probability the first two shavers will be returned to the store
because they are defective?
(b) What is the probability that the first two shavers will not be defective?

9. Hassan and Wendy appear for an interview for two vacancies in the same post. The
d d
probability of Hassan’s selection is and that of Wendy’s selection is . Find the
U X
probability that
(a) Only one of them will be selected.
(b) Both of them will be selected.
(c) None of them will be selected.
(d) At least one of them will be selected.

91
10. The table below presents a sample of employees surveyed regarding their loyalty to
XYZ Pte Ltd

Length of Service
Loyalty
<1 1- 3 3-6 6-9
Year, Years, Years, Years, > 9 Years,
B1 B2 B3 B4 B5 Total
Would remain, A1 18 11 5 28 12 74
Would not remain, A2 7 10 6 13 10 46
25 21 11 41 22 120

Using the table above, answer the following questions:

(a) State the probability of selecting an employee with more than 9 years of service.
(b) State the probability of selecting an employee who would not remain with the
company, given that he or she has more than 9 years of service.
(c) State the probability of selecting an employee with more than 9 years of service
or one who would not remain with the company.

11. DBS Bank reports that 66 percent of its customers maintain a checking account, 83
percent of its customers have a savings account, and 53 percent have both. Assuming
that a customer is being chosen randomly, state the probability of selecting a customer
who has either a checking account or a savings account. Also, state the probability of
selecting a customer who has neither a checking account nor a savings account.

12. Daniel takes 5 types of health supplements each morning. He changes the sequence of
consumption each day. How many possible sequences are there for Daniel to consume
these health supplements?

13. You got free tickets to watch the National Day Parade and you can bring along three
friends. However, you have five friends who want to come along. How many groups
of different friends can you take with you?

14. In how many ways can a panel of judges award the 1st, 2nd and 3rd prize among 12 senior
citizens participating in a singing competition?

15. A newly opened cafe provides value set meals at $5 only. Customers can select a main
course (chicken or fish), one type of vegetable (broccoli, spinach or cauliflower), one
side dish (soup, corn or scrambled egg) and one drink (coffee or tea). How many
different meal arrangements are possible?

92
16. A random variable takes the value 0, 1 and 4 according to the following distribution:

X P(X)
0 0.2
1 0.4
4 0.4
Total 1.00

(a) Compute the mean, variance and standard deviation.


(b) Find the probability X is not equal to zero.

93
BUSINESS STATISTICS

SESSION 6

USE OF EXCEL FOR DATA ANALYSIS 2

At the end of the session, students should be able to:

1. input collected data in Excel to facilitate data analysis.


2. generate data using formulas and basic functions.
3. apply excel functions/commands for data analysis.
________________________________________________________________

This is a continuation of Session 4 where we have learnt how to use Excel to perform various
data/statistical analysis.

Students will continue to work on Excel Lab Exercise 3.

94
EXCEL LAB EXERCISE 3

Use EXCEL to complete the following questions:


[Refer back to Session 4 for the steps or commands if required]

1. Health care is an important issue to many people including the government. Researchers
recently conducted a survey of citizens over 60 years of age whose net worth was too
high to qualify for government medical insurance and who have no private health
insurance. The ages of 25 uninsured senior citizens were as follows:

60 61 62 63 64 65 66 68 68 69 70 73 73
74 75 76 76 81 81 82 86 87 89 90 92

(a) Find the mean and the sample standard deviation of the ages of the uninsured
senior citizens.
(b) Set up a frequency distribution (including relative frequency) as shown below:

Age (years) Frequency Relative Frequency


60 up to 65
65 up to 70
70 up to 75
75 up to 80
80 up to 85
85 up to 90
90 up to 95
Total

(c) Present the data from (b) in a histogram.


(d) Describe the shape of the distribution.

2. The following sample of 10 observations of number of heartbeats per minute after an


exercise routine is taken from an infinite population with normal distribution:

75 76 83 91 80 77 84 81 80 73
(a) Find the mean.
(b) Find the median and mode.
(c) Determine the standard deviation.

3. The following shows the net profits($) of 12 branches of Everfresh Florist Shop on
Mother’s Day.
903 1745 3883 863 1204 1624
1698 957 1041 1138 1354 1802

Determine the:

(a) Mean ______________


(b) Median ______________
(c) Standard deviation _______________

95
4. Western Digital Media has engaged an independent market consultant to conduct a
survey on whether there is any relationship between the sales of its products and the
advertising expenditures. The data are collected as follows:

Advertising ($000) Sales ($000)


X Y
0.8 22
1.0 28
1.6 22
2.0 26
2.2 34
2.6 18
3.0 30
3.0 38
4.0 30
4.0 40
4.0 50
4.6 46

(a) Identify the independent and dependent variables.


(b) Create a scatter diagram with the independent variable on the horizontal axis
and dependent variable on the vertical axis.
(c) State the direction of the relationship between advertising and sales.

5. The local ice cream shop keeps track of how much ice cream they sell versus the noon
temperature on that day. Here are their figures for the last 12 days:

Temperature (oC) Sales ($)


X Y
13 215
16 325
12 185
15 332
19 406
22 522
19 412
25 614
23 544
18 421
23 445
17 408

(a) Identify the independent and dependent variables.


(b) Create a scatter diagram with the independent variable on the horizontal axis
and dependent variable on the vertical axis.
(c) State the direction of the relationship between temperature and sales. Does the
relationship appear strong?

96
6. The blood groups of 20 patients are listed below:

Patient No Blood Group


1 A
2 O
3 AB
4 O
5 A
6 AB
7 O
8 B
9 AB
10 A
11 O
12 O
13 B
14 O
15 AB
16 B
17 O
18 B
19 B
20 O

(a) Create a frequency table.


(b) Create a pie chart.
(c) What percentage of patients have blood group type AB?
(d) What percentage of patients have blood group type O?
(e) What percentage of patients have blood group type A or B?

97
EXCEL LAB EXERCISE 3 (ANSWERS)

1(a) 74.04
9.7446
(b)
Age (years) Frequency Relative Frequency
60 up to 65 5 20%
65 up to 70 5 20%
70 up to 75 4 16%
75 up to 80 3 12%
80 up to 85 3 12%
85 up to 90 3 12%
90 up to 95 2 8%
Total 25 100%

(c)
HIstogram
6
Frequency

0
1 2 3 4 5 6 7
Age (years)

(d) Positive skewed (right) skewed.

2(a) 80
(b) 80, 80
(c) 5.228

3(a) 1517.67
(b) 1279
(c) 819.558

4(a) Independent (X) : Advertising ($000)


Dependent (Y) : Sales ($000)
(b)
Advertising Vs Sales
60
Sales ($000)

40

20

0
0 1 2 3 4 5
Advertising ($000)

(c) Direct (positive) relationship

98
5(a) Independent (X) : Temperature
Dependent (Y) : Sales ($)
(b)

Temperature vs Sales
800
Sales ($000)
600
400
200
0
0 5 10 15 20 25 30
Temperature (degree celsius)

(c) Positive (direct) relationship. Relationship is strong.

6(a)
Count of Blood
Row Labels Group
A 3
AB 4
B 5
O 8
Grand Total 20

(b)
Piechart
15%

40%
20%
25%

A AB B O

(c) 20%
(d) 40%
(e) 40%

99
BUSINESS STATISTICS

SESSION 7

LINEAR REGRESSION AND CORRELATION

At the end of the session, students should be able to:

1. construct and interpret scatter plots of bivariate quantitative variables.


2. identify types of relationships between two quantitative variables.
3. fit a regression equation using least squares method.
4. interpret the slope and y-intercept in the regression equation.
5. calculate and interpret the correlation coefficient.
6. calculate and interpret the coefficient of determination.
7. understand the limitations of linear regression.

_________________________________________________________________

1. Introduction

Research studies involving data rarely focus on a single factor (variable). Most of the time, we
wish to know the relationships or associations between two or more variables. In this topic, our
attention will be focused on the relations between two quantitative variables. In such cases, two
variables will be recorded for each sampling unit at a particular point in time. These paired data
are called ‘bivariate data’.

2. Scatter Diagram

A scatter diagram or scatter plot is helpful in detecting a relationship between two variables. It
shows the paired values of two variables and is constructed by using the horizontal axis as the
independent variable (X) and the vertical axis as the dependent variable (Y).

Ideally, once a scatter diagram is drawn, we try to “best fit” a line through it in such a way that
it indicates to us the type of relationship between the two variables.

Example 7.1
The following chart (Table 7.1) shows the number of sales calls made by bank relationship
managers and the number of investment products that they have sold over the past year. Draw
a scatter diagram.
Number of Number of
Manager
sales calls products sold
Ali 150 70
Joe 100 40
Maria 50 60
Lina 50 30
Sue 150 40
Herbert 100 50
Bernard 40 30
Donald 200 70
Table 7.1 Sales Data

100
Solution:
We are examining whether the number of investment products sold (dependent variable) is
affected by the number of sales calls made (independent variable). We draw the dependent
variable (y) on the vertical axis and the independent variable (x) on the horizontal axis.
Thereafter, plot the points for each pair of x and y values.

The final completed scatter diagram is shown in Figure 7.1

No of Sales Calls vs No of
Products Sold
80
No of Products Sold

60

40

20

0
0 50 100 150 200 250
No of Sales Calls

Figure 7.1

The diagram indicates a direct relationship between number of sales calls made and the
number of products sold.

2.1 Identifying relationship between variables

Once a scatter diagram is drawn, we look for trends. Is it possible to have a line that best fits
most of the points? Let’s identify the various types of relationships between the variables.

Direct (Positive) versus Inverse (Negative) Relationship

Direct: As x increases, y also increases. For example, more advertising expense leads to higher
sales.

Inverse relationship: As x increases, y decreases and vice versa. For example, when a car gets
older (i.e. age of car increases), the selling price declines.

Linear versus Non-linear Relationship

Linear: A straight line best describes the trend.

Non-linear: A curve or non-linear line best describes the trend.

No Relationship
This occurs when we are unable to identify any obvious trend in the relationship between the
two variables.

101
Figure 7.2 shows the various relationships between variables. Our focus will be on “linear”
relationships between two variables.

Direct and Linear Inverse and Linear

Direct and Non-linear


Inverse and Non-linear No Relationship

Figure 7.2 Possible Relationships Between Variables

3. Regression Analysis

In regression analysis, we try to find a line that best fits the points in the scatter diagram. The
least squares method provides us such a line.

Least Squares Principle


The line of best fit is the regression equation for which the sum of the squared residuals (or
errors) i.e. ∑(𝑦 − Ž𝑦3 )L is the smallest. See Figure 7.2

Figure 7.3 Least Squares Principle


yi = actual observed value
𝑦•3 = estimated value
Sum of squares of residuals =∑(𝑦 − Ž𝑦3 )L

102
3.1 Least Squares Regression Equation

Estimating equation is given by:


ŷ = a + bx
where
ŷ= predicted dependent variable
x= independent variable
a=Y-intercept (value of y when x=0)
b= slope of the line (change in y for one unit change in x)

The formulas which are used to calculate a and b based on the least squares regression principle
are:
𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑏=
𝑛 ∑ 𝑋 L − (∑ 𝑋)L

𝑎 = 𝑌’ − 𝑏𝑋’

where
X = values of the independent variable
Y = values of the dependent variable
n = no of pairs of data values

Example 7.2
AA Research trains research assistants to conduct surveys through face-to-face interviews.
Recently, the firm has undertaken a research project. The figures below show the number of
weeks that these research assistants have worked for the firm and the number of face-to-face
interviews conducted by each research assistant on a given day.

No of
Research Experience
interviews
Assistant (weeks)
conducted
1 15 4
2 41 9
3 58 12
4 18 6
5 37 8
6 52 10
7 28 6
8 24 5
9 45 10
10 33 7

103
(a) Find the least squares regression equation using weeks of experience as the
independent variable and number of interviews conducted as the dependent variable.

Solution:

We will first compute ∑ 𝑥 , ∑ 𝑦, ∑ 𝑥 L , ∑ 𝑦 L , ∑ 𝑥𝑦 which will be required when we use


the regression/correlation formulas.

X Y X2 Y2 XY
15 4 225 16 60
41 9 1691 81 369
58 12 3364 144 696
18 6 324 36 108
37 8 1369 64 296
52 10 2704 100 520
28 6 784 36 168
24 5 576 25 120
45 10 2025 100 450
33 7 1089 49 231
∑ 𝑋 =351 ∑ 𝑌 =77 ∑ 𝑋 L =14141 ∑ 𝑌 L =651 ∑ 𝑋𝑌 =3018

𝒏(𝚺𝑿𝒀)g(𝚺𝑿)(𝚺𝒀) 𝟏𝟎(𝟑𝟎𝟏𝟖)g(𝟑𝟓𝟏)(𝟕𝟕)
𝒃= 𝒏(𝚺𝑿𝟐 )g(𝚺𝑿)𝟐
= 𝟏𝟎(𝟏𝟒𝟏𝟒𝟏)g(𝟑𝟓𝟏)𝟐
= 0.173

𝟕𝟕 𝟑𝟓𝟏
z − 𝒃𝑿
𝒂=𝒀 z= − 𝟎. 𝟏𝟕𝟑 › œ = 𝟏. 𝟔𝟐𝟖
𝟏𝟎 𝟏𝟎
• = 𝟏. 𝟔𝟐𝟖 + 𝟎. 𝟏𝟕𝟑𝑿
Regression equation : 𝒀

(b) Interpret the y-intercept and the gradient.

Solution:

Interpretation of ‘a’
The Y-intercept (i.e. the value of ‘a’) represents the value of Y when X equal zero. With no
work experience, we expect the number of interviews conducted to be about 1.628.

We should, however, be very careful while making this interpretation of a. In our sample of
ten research assistants, the weeks of experience varies from 15 to 58. Since x=0 is outside this
range of experience, the prediction usually will not hold true. This will be explained further
in section 3.2.

Interpretation of ‘b’
The value of b is 0.173. ‘b’ refers to the gradient or slope of the regression line. It gives the
change in y (dependent variable) due to a change of one unit in x (independent variable). In
this example, we can state that on average, a one week increase in work experience will
increase the number of interviews conducted by 0.173 units.

A positive b value means a direct relationship.

104
3.2 Prediction Using Linear Regression

A regression equation allows us to predict the y value for any given x value.

Interpolation
This is to use the regression line, 𝑌ž = 𝑎 + 𝑏𝑋 to find the estimated value of y for any given value
of x that lies within the data set.

Extrapolation
This is to use the regression line, 𝑌ž = 𝑎 + 𝑏𝑋 to find the predicted value of y for a given value of x
that lies beyond the range of the data set. Extrapolation must be used with caution. This is explained
in Section 5.

Example 7.3
From the regression equation obtained from Example 7.2, predict the number of interviews
conducted by a research assistant with 20 weeks of work experience.

Solution:

The estimating equation is : 𝑌ž = 1.628 + 0.173𝑋

When x=20, : 𝑌ž = 1.628 + 0.173(20) = 5.088 i.e. about 5 interviews conducted.

4. Correlation Analysis

4.1 Correlation Coefficient, r

A number that indicates the direction and strength of the linear relationship between an
independent variable (X) and a dependent variable (Y). In other words, r measures how closely
the points in a scatter diagram are spread around the regression line.

The correlation coefficient is calculated using the following formula:

𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑟=
k[𝑛 ∑ 𝑋 L − (∑ 𝑋)L ][𝑛 ∑ 𝑌 L − (∑ 𝑌)L ]

The value of r always lies in the range -1 to +1, that is, -1 £ r ³ 1

In a direct relationship where y increases as x increases, b is positive and r will be positive. In


an inverse relationship where b is negative, r will be negative. If r=1, it is said to be a case of
perfect, positive, linear correlation. If r= -1, the correlation is said to be perfect negative,
linear correlation. See Figure 7.4 for varying correlation between variables.

105
Strong, positive linear correlation Weak, positive linear correlation
(r is close to 1) (r is positive and close to about 0.5)

Strong, negative linear correlation Weak, negative linear correlation


(r is close to -1) (r is negative and close to about -0.5)

Figure 7.4 Linear correlation between variables

Example 7.4
Based on the data from example 7.2, compute the correlation coefficient.

Solution:

𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑟=
k[𝑛 ∑ 𝑋 L − (∑ 𝑋)L ][𝑛 ∑ 𝑌 L − (∑ 𝑌)L ]

10(3018) − (351)(77)
=
¢[10(14141) − (351)L ][10(651) − (77)L ]

= 0.969

Interpretation of ‘r’
The correlation coefficient is 0.969. Since the value is close to 1, it indicates a strong,
positive correlation.

4.2 Coefficient of Determination, r2

How good is a regression model? In other words, “How well does the independent variable
explain the dependent variable in the regression model”. The coefficient of determination
answers this question.

106
The coefficient of determination, denoted by r2 represents the proportion of variation in the
dependent variable (y) that is explained by the variation in the independent variable (x). The
value of r2 ranges from 0 to 1.

Example 7.5
Further to Example 7.4, compute the coefficient of determination and interpret the value.

Solution:

r2 = (0.969)2 = 0.939
Interpretation of ‘r2’
About 93.9% of the variation in the number of interviews conducted (the dependent variable)
is explained by the variation in weeks of experience (the independent variable).

5. Cautionary Notes

We should apply linear regression with caution. Here are some points to watch out for:

Non-linear relationships
We have only seen how to use a straight line to model the best fit. Sometimes, the relationship
may not be linear. Hence, it is good to construct a scatter diagram and look at the plot before
we use simple linear regression

Extrapolation
Linear regression equation is established according to the set of data collected. If you use the
estimated regression line for prediction using values which lie outside the range of original data
collected, the estimates may be inaccurate. This is known as extrapolation.

For example, the value of x in our example on number of weeks of experience and number of
interviews conducted vary from 15 to a maximum of 58 weeks. Hence, our estimated
regression line is only applicable for value of x falling within these values. If we predict y for a
value of x either less than 15 or greater than 58, it is called extrapolation. We should interpret
such prediction cautiously and not attach much value to them.

Association does not imply Causation


A strong correlation simply means that the movement of one variable strongly follows the
movement of another variable. This, however, does not necessarily mean that one variable
causes the movement of the other variable. There could be other factors influencing both
variables at the same time.

For example, data in a particular country showed an increase in car sales as well as increase in
sale of new homes. One does not cause the other, the cause is probably related to higher
incomes which is a third variable that is not included in the study.

Units of measurement
Be careful about the units of measurement used to obtain the regression equation e.g. 000s or
millions. For example, if X is the advertising expenditure and the value of the original x data
is in 000s, then a value of $10,000 would mean x =10 and not 10,000.

107
6. Performing Regression/Correlation Analysis using Excel

Input the following data (Figure 7.5) showing credit card limit ($000) and average monthly
spending($) of customers.

Figure 7.5

Next click Tools -> Data Analysis -> Regression

Figure 7.6

Select the input range for the dependent variable (Y) and independent variable (X):

Figure 7.7

108
The following excel output (Figure 7.8) will be generated

Correlation coefficient (r)

Coefficient of determination (r2)

Y-intercept (a)
Slope or Gradient (b)
Figure 7.8 Regression Excel Output

From the output, we can obtain the following information:

• Correlation coefficient (r) = 0.756 (note: this value is displayed without the direction sign).
It will be positive if the slope (b) is positive. If the slope (b) is negative, then r will be
negative. In this case, r is positive.

• Coefficient of determination (r2) = 0.572

• Y-intercept (a) = 475.734

• Gradient or Slope (b) = 17.408

• Regression Equation : 𝑌ž = 475.734 + 17.408𝑋

109
7. Discussion questions

1. The sales($mil) and advertising ($mil) data were collected from XYZ Co for a sample
of 4 months.

Advertising $m (X) Sales $m (Y) XY X2 Y2


Jul 2 7
Aug 1 3
Sep 3 8
Oct 4 10
Total 10 28

(a) Using Advertising Expense as the independent variable and Sales as the dependent
variable, draw the scatter diagram. Does the diagram indicate a relationship between
the two variables?

(b) Compute the gradient “b”

(c) Compute the y-intercept “a”

(d) What is the regression equation?

(e) Interpret the values of the gradient “b” and the y-intercept “a”.

(f) Compute the correlation coefficient “r”. Interpret.

(g) Compute the coefficient of determination “r2”. Interpret.

(h) Predict the sales value if advertising expense is $3 million.

2. The manufacturer of Home Exercise Machine is studying the relationship between the
number of months that machine has been purchased and the usage time in the last one
week. A phone survey produced the following results:

Customer ID Months from purchased date (X) Hours used last week (Y)
891 4 7
832 6 0
621 9 1
319 9 2
756 2 6
753 6 0
669 7 2
900 4 7
764 6 4
428 6 0

110
A regression analysis produced the following partial output:
Regression Statistics
Multiple R ?
R Square 0.4518
Adjusted R Square 0.3832
Standard Error 2.2656
Observations 10

Coefficients Std Error t Stat P-value


Intercept 8.14 2.16 3.76 0.005518
Months from purchased date -0.89 0.35 -2.57 0.033255

(a) From the statistical output given or otherwise, determine the coefficient of
determination and interpret the result.

(b) Compute the coefficient of correlation (use 3 dec pl) and interpret the result.

(c) Write the regression equation and interpret the meaning of the slope.

(d) Use the regression equation to predict the number of hours used in the last one week for
a customer that has bought the machine 5 months ago.

(e) If a customer has bought a machine 24 months ago, can we still use the regression
equation to predict the number of hours used in the last one week by the customer? Why
or why not?

3. The following table and chart show a sample of 10 companies in the restaurant
industry with their respective number of employees and annual profits.

Number of Annual Profit


employees ($)

1320 11,880,000
721 6,489,000
667 5,336,000
902 9,020,000
753 7.530,000
1396 11,168,000
1219 12,190,000
727 6,543,000
675 5,400,000
609 4,872,000

111
(a) Describe the relationship between the two variables from the scatter plot.

(b) Write down the regression equation using the statistical output. (use 3 dec pl)

(c) What is the coefficient of determination? Interpret.

(d) Name two other quantitative variables that may have a relationship with the
annual profits of a firm.

(e) What is the correlation coefficient? Interpret.

(f) Explain how annual profit changes for every additional employee.

(g) If a firm has 800 employees what is the expected amount of annual profit?

(h) If the firm has 3000 employees, can we still use the regression equation to
predict the annual profit? Explain.

(i) What is the expected change in annual profits if a firm reduces the number of
employees by 200?

(j) A firm has annual profits of $8 million, how many employees would you expect
this firm to have? (Round answer to nearest whole number)

112
8. Supplementary questions

1. Food Shop employs several sales representatives who call retail grocery outlets for the
purpose of merchandising the company’s food products. Sarah, the sales director
wishes to determine the relationship between the number of calls a sales representative
makes to a given retail outlet and the amount of the company’s food products purchased
by the outlet. She selected 5 retail outlets at random and obtained the following
information:

Retail Grocery Number of sales calls during one Monthly Sales to Outlet
Outlet month ($000)
A 7 73
B 6 68
C 5 60
D 3 45
E 4 54

(a) Determine the linear regression equation using the least square method, with the number
of sales calls as the independent variable.

(b) Interpret the value of the gradient “b”.

(c) Henry, a sales representative, makes 6 sales calls to an outlet (known as Outlet K) during
the month. Estimate the monthly sales to Outlet K.

(d) Calculate the correlation coefficient and explain what it means.

2. The table below shows the data on incomes and food expenditures of seven households.

Income ($00) Food Expenditure ($00)


35 9
49 15
21 7
39 11
15 5
28 8
25 9

Given: ∑ 𝑥 = 212 ∑ 𝑦 = 64 ∑ 𝑥 L = 7222 ∑ 𝑦 L = 646 ∑ 𝑥𝑦 = 2150

(a) Draw and label the scatter diagram of these data.

(b) Comment on the scatter diagram.

(c) Calculate the correlation coefficient. Interpret your findings.

(d) Calculate the least squares regression equation.

(e) Interpret the gradient.

113
(f) Predict the food expenditure for a household with income of $3,100.

(g) Would you use the least squares line in part (d) to predict the food expenditures of a
household with income of $5,500? Justify your answer.

3. For a particular car brand, you wish to study the relationship between the age of a car
and its selling price. Listed below is a random sample of 12 used cars sold during the
last year.
Car Age (X) Selling Price in $000 (Y)
1 9 8.1
2 7 6.0
3 11 3.6
4 12 4.0
5 8 5.0
6 7 10.0
7 8 7.6
8 11 8.0
9 10 8.0
10 12 6.0
11 6 8.6
12 6 8.0

Given: ∑ 𝑥 = 107 ∑ 𝑦 = 82.9 ∑ 𝑥 L = 1009 ∑ 𝑦 L = 615.29 ∑ 𝑥𝑦 = 712.9

(a) If we want to estimate selling price on the basis of the age of the car, which variable is
the dependent variable and which is the independent variable?

(b) Draw a Scatter Diagram.

(c) Determine and interpret the coefficient of correlation (r).

(d) Determine and interpret the coefficient of determination (r2).

(e) Determine the regression equation.

(f) Estimate the selling price of a 10-year-old car.

(g) Interpret the gradient.

114
4. The management of Hello Electronics wants to investigate the relationship between the
years of experience and the number of units of Product M assembled by its employees
working in the assembly department. The management took a sample of seven employees
from the assembly department and observed them for a week. The following table gives
data on the years of experience for these employees and the number of units of Product
M each of them assembled per day.

Experience 5 11 15 7 2 10 9
No of units assembled 14 21 20 18 13 16 18
Given: Sx = 59 Sy =120 Sx2 = 605 Sy2 =2110 Sxy = 1075

(a) Find the least squares regression equation with experience as the independent variable
and units assembled as the dependent variable.

(b) Give a brief interpretation of the values of “a” and “b” calculated in part (a).

(c) Predict number of units of Product M assembled by an employee with 12 years of


experience.

(d) Estimate the number of units of Product M assembled per day by a worker with 25 years
of experience. Comment on this finding.

(e) Calculate the correlation coefficient. Explain what it means.

5. A researcher would like to investigate the relationship between flight hours on a nonstop
trip and the one-way airfare being charged for economy class in short-haul flights within
Asia on a non-peak weekday.

Data are gathered as shown in the table below:


Flight hours Airfare ($)
1.50 137
1.50 280
1.08 113
5.75 440
6.30 571
5.50 558
7.25 899
7.08 1393
4.00 552
4.25 662

115
A regression analysis produced the following output:

Regression Statistics
Multiple R 0.817506136
R Square 0.668316283
Adjusted R
Square 0.626855818
Standard Error 231.3501891
Observations 10

Standard
Coefficients Error t Stat P-value
Intercept -20.4233154 162.136052 -0.12596406 0.90286852
Flight hours 131.400885 32.7283682 4.01489266 0.00386858

(a) State the independent and dependent variables.

(b) Write the regression equation. (Express the coefficients in equation to 3 decimal places)

(c) Determine the correlation coefficient and interpret the result. (Express your answer to
3 decimal places)

(d) Determine the coefficient of determination and interpret the result. (Express your
answer to 3 decimal places)

(e) Using the regression equation, estimate the expected airfare for a flight that takes 5
hours. (Express your answer to 2 decimal places)

(f) What is the expected change in the airfare for a flight that takes 2 hours longer.
(Express your answer to 2 decimal places)

(g) Based on the information above, can we estimate the airfare for flight from Singapore
to London that takes at least 13 hours? Explain.

116
6. The Traffic Police issues demerit points to motorists who have committed traffic
violations on the road so as to identify high-risk motorists or habitual traffic offenders.
A study was carried out to investigate whether years of driving experience affects the
number of demerit points chalked up by motorists over the last two years.

The results from a sample of 10 motorists are shown below:


Years of driving experience Number of demerit points
2 12
3 14
19 8
30 4
25 2
15 4
12 4
8 12
35 0
34 6

A regression analysis produced the following output:


Regression Statistics
Multiple R 0.787615727
R Square 0.620338533
Adjusted R Square 0.572880849
Standard Error 3.083913059
Observations 10
Coefficients Standard Error t Stat P-value
Intercept 12.12447768 1.812707541 6.688601114 0.000154513
Years of driving experience -0.30188402 0.083498567 -3.61543961 0.00682759

(a) State the independent and dependent variables.

(b) Set up a scatter diagram and label the axes clearly. Does the diagram indicate a
relationship between the 2 variables?

(c) Given that the linear regression equation is 𝑌ž = 12.124 − 0.302𝑋, state and interpret
the value of the slope.

(d) Determine the coefficient of correlation and interpret the value. (Express your answer
to 3 decimal places).

(e) Kareem has 9 years of driving experience while his father has 29 years of driving
experience. How many more or less demerit points would you expect Kareem to have
over the last two years? (Express answer as a whole number)

(f) Mr Gan has chalked up 11 demerit points over the last two years. How many years of
driving experience would you expect Mr Gan to have? (Express answer as a whole
number)

117
BUSINESS STATISTICS

SESSION 8

NORMAL DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS

At the end of the session, students should be able to:

1. compute the areas / probabilities for a normally distributed variable.


2. understand the sampling distribution of sample means and its applications.
3. understand Central Limit Theorem.

___________________________________________________________________________

1. Introduction

The normal distribution is the most important and most widely used of all the probability
distributions. A large number of phenomena in the real world tend to be normally distributed
or are approximately normally distributed. Continuous variables like height, weight, scores in
an examination, lifespan of an electronic item etc. usually follow approximately to a normal
distribution.

2. Characteristics of a normal distribution

Figure 8.1 The Normal Distribution

The normal probability distribution, when plotted gives a bell-shaped curve such that

• the curve is symmetric (bell-shaped) around the mean.


• the distribution has a single peak. Mean, median & mode are all identical.
• the area under the curve = 1 or 100%. Hence, ½ of the total area lies on the left side of the
mean and ½ lies on the right side.
• the two tails of the curve extend indefinitely and never reaches the horizontal axis
(asymptotic).
• two parameters namely mean (µ) and standard deviation (s) are needed for any normal
distribution.

Normal distribution curves can have the same mean but different standard deviations (see
Figure 8.2) or different means but the same standard deviations (see Figure 8.3). A larger

118
standard deviation (s) results in a wider and flatter normal curve, which indicates more
variability or dispersion among the data.

Figure 8.2 Normal Distributions with same mean but different standard deviations

Figure 8.3 Normal Distributions with different means but same standard deviations

3. The Standard Normal Distribution

Assume we have a random variable, X, which is normally distributed with mean µ and the
standard deviation s. If we want to find the probability of x lying within certain values, we
will be finding the area under the curve that covers the values ranging from a to b to give us
P(a £ X £ b). (see Figure 8.4).

Figure 8.4 Area under the normal curve

The calculation of the probability (or area) involves more complex calculus and probability
density functions. Here, we will see how to find this probability – that is, the area under normal
curve, using a statistical table called the standard normal table.

Since a normal curve is specified by the mean (µ) and the standard deviation (s), there would
be a different normal curve for each possible pair of µ and s. That is, we need to have one

119
normal table for each normal curve to find out the probabilities. In order to overcome the
problem of having limitless number of normal distribution tables, we can “standardize” the
normal curve by expressing the original values of normal random variables in terms of ‘number
of standard deviations away from mean’.

With this approach, we can use one standard normal table for all normal curves. We refer to
this ‘one’ normal curve as the standard normal distribution. The standard normal distribution
has a mean of 0 and a standard deviation of 1.

Figure 8.5 displays the standard normal distribution curve. The units for the standard normal
distribution curve are denoted by z. We can call these units z values or z scores.

Figure 8.5 The standard normal distribution curve

Note that the values of z on the left side of the mean are negative. However, the probability or
the area under the curve is always positive. A point with a z-value of 2 means that the point is
two standard deviations to the right of the mean. Similarly, a point with a z-value of -2 means
that the point is two standard deviations to the left of the mean.

How do we obtain the z-value for varying values of x? We note that all intervals containing
the same number of standard deviations from the mean will contain the same proportion of the
total area under the curve for any normal random variable, X.

We convert an x-value to a z-value using the following formula:


𝑥−𝜇
𝑧=
𝜎

where
z = number of standard deviations from x to µ
x = value of the random variable
µ = mean of the distribution of this random variable
s = standard deviation of this distribution

Example 8.1
The monthly incomes of security officers follow the normal distribution with a mean of $1,500
and a standard deviation of $300.
(a) What is the z value of an officer who earns $1,200 per month?
(b) What is the z value of an officer who earns $1,900 per month?

120
Solution:
Qgh
(a) For x = 1200, 𝑧 = ¤

dLVVgdXVV
= TVV

= -1
Qgh
(b) For x = 1900, 𝑧 = ¤

d¥VVgdXVV
= TVV

= 1.33

3.1 Finding Areas under the Normal Curve

As noted earlier, a z-value shows the distance between a particular value of x and the µ in terms
of number of standard deviations from the mean. The table in Appendix 2 (The Standard
Normal Table) lists the areas or probabilities for this standardized distribution. Figure 8.6
shows a portion of these probabilities.

Figure 8.6 Areas under the normal curve

We shall now apply the standard normal distribution to find the area (probability) in a normal
distribution for varying values of x.

Example 8.2
The monthly incomes of security officers follows the normal distribution with a mean of $1500
and a standard deviation of $300. What is the probability that an officer earns
(a) between $1,500 to $1,800 per month?
(b) more than $1,200 per month?
(c) less than $1,200 per month?
(d) between $1,800 to $2,000 per month?

121
Solution (a):
dYVVgdXVV
P(1500 £ x £1800) = P(0 £ z £ TVV
)
= P( (0£ z £ 1.00)

= 0.3413

The probability associated with a z of 1.00 is available from Appendix 2. To locate the
probability, go down the left column to 1.0 and then move horizontally to the column heading
0.00 (see Figure 8.7)

Figure 8.7 Reading z-value and corresponding probability

Solution (b):
dLVVgdXVV
P( x >1200) = P(z > TVV
)
= P(z > -1.00)

= 0.3413 +0.5

= 0.8413

Recall that half the area of a normal curve is above the mean. So, the probability of selecting
an officer earning above $1,200 is obtained by adding two areas, that is, 0.3413 + 0.5.

122
Solution (c):
dLVVgdXVV
P( x <1200) = P(z < TVV
)
= P(z < -1.00)

= 0.5 – 0.3413

= 0.1587

Since half the area under the curve is 0.5 and the area between $1,200 and $1,500 is 0.3413,
the probability of x being less than $1,200 is (0.5 – 0.3413).

Solution (d):
dYVVgdXVV LVVVgdXVV
P(1800 £ x £ 2000) = P( TVV
£ z £ TVV
)

= P(1.00 £ z £ 1.67)

= 0.4525 – 0.3413

= 0.1112

The situation is again separated into two parts. The probability of salaries lying between $1,500
to $1,800 is 0.3413. The probability of salaries lying between $1,500 to $2,000 is 0.4525.
Thus, the probability of salaries lying between $1,800 to $2,000 is (0.4525 – 0.3413).

3.2 Finding z and x values with a given area

We will now do a reverse procedure where we find the corresponding value of z or x when an
area under a normal curve is known.

Example 8.3
Find the value of z such that the area under the curve between 0 and z is 0.3888.

Solution:

123
To obtain the z value, we have to locate 0.3888 in the body of the standard normal table. Then
we read the numbers in the z column and the header to obtain the z value of 1.22 (see Figure
8.8).

Figure 8.8 Inverse use of table

Example 8.4
Applicants for a particular job are required to sit for an aptitude test. The test scores are
normally distributed with a mean of 40 and a standard deviation of 7. Hilary is going to sit for
this test soon. What should her score be so that only 15% of all who sit for this test score higher
than she does?

Solution :
Let x represent the test scores of all job applicants. We wish to find the value of x such that the
area under the curve to the right of x is 15%.

The area between µ and x is (0.5 -0.15) = 0.35


To find the z value corresponding to the x value, we look for 0.3500 in the body of the standard
normal table (Appendix 2). The value closest to 0.3500 is 0.3508 which corresponds to a z
value of 1.04. Hence, we have
𝑥−𝜇
𝑧 =
𝜎
𝑥 − 40
1.04 =
7
x = 47.28

If Hilary scores 47.28 on the test, only about 15% of job applicants are expected to score higher
than she does.

124
4. The Sampling Distribution

For any population data set, there is only one value of the population mean, µ. However, when
we deal with samples, we would expect different samples of the same size drawn from the same
population to yield different values of the sample mean, 𝑥̅ . Like any other random variables,
the sample mean, 𝑥̅ possesses a probability distribution which is called the sampling distribution
of 𝑥̅ .

Hence, we have this definition “The sampling distribution is the probability distribution of a
sample statistic, that is the sample mean.” It results from the drawing of all possible samples
of a given size from the population regarding a sample statistic.

4.1 z
Mean and Standard Deviation of 𝒙

The mean and standard deviation of the sampling distribution of 𝑥̅ are denoted by 𝜇Q̅ and 𝜎Q̅ .
The standard deviation of 𝑥̅ is known as the standard error (𝜎Q̅ ).

If we take all possible samples (of the same size) from a population and calculate their means,
you will find that the mean (𝜇Q̅ ) is always equal to the population mean, µ. This can be proven
with a simple example below:

Example 8.5
The following data give the years of experience for all four employees of a small company.
The random variable, X = Years of experience.

Employee X
Mark 1
Frank 1
Dawn 3
Sue 5

We can now compute the population mean (µ).


1+1+ 3 + 5
µ= = 2.5 years
4

We now list all the possible samples of size 2 (n=2) from this population.

Sample Sample mean ( X )


Mark, Frank 1+1
𝑥̅ = =1
2
Mark, Dawn 1+3
𝑥̅ = =2
2
Mark, Sue 1+5
𝑥̅ = =3
2
Frank, Dawn 1+3
𝑥̅ = =2
2
Frank, Sue 1+5
𝑥̅ = =3
2
Dawn, Sue 3+5
𝑥̅ = =4
2

125
We shall calculate the mean for this sampling distribution of means, that is, we obtain
an average of all the sample means.
dILITILITIW
𝜇Q̅ = Z
= 2.5 which is exactly equal to µ. (proven)

The standard deviation of the sampling distribution is given by the formula below:
𝜎
𝜎Q̅ =
√𝑛
We call this the Standard Error.

4.2 Shape of the Sampling Distribution

If the original population is normally distributed, the sampling distribution of sample means
will also be normal, whatever the value of the sample size (n).

However, if the original population is NOT normally distributed, then the sampling distribution
of sample means will be approximately normal only for large sample sizes (n ³ 30). The
distribution approaches normal distribution as the sample size n increases. This is known as
the Central Limit Theorem. (See Figure 8.9)

Figure 8.9 Shape of sampling distribution

Central Limit Theorem –


According to the central limit theorem, for a large sample size, the sampling distribution
of 𝑥̅ will be approximately normal, irrespective of the shape of the population distribution.
The mean and standard deviation of the sampling distribution of 𝑥̅ are
s
µX = µ and sX =
n
The sampling distribution of X will approach normality as n increases. The sample size
is usually considered to be large if n ³ 30.

Why is the Central Limit Theorem important?


The Central Limit Theorem (CLT) is the basis to the concept of statistical inference because it
permits us to draw conclusions about the population based on sample data without the
knowledge of the distribution of the underlying population distribution. The only requirement
is to have a sufficiently large sample size.

126
5. Computing Probabilities for a Sample Mean

When the sampling distribution is normally distributed, we are able to find the probability of
a sample mean taking on certain values within a specified range. We will need to find the
corresponding z values for values of 𝑥̅ in order to use Appendix 2. (Standard Normal Table).

The z value for a value of 𝑥̅ is calculated as

𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛

Example 8.6
(Large sample, non-normal population distribution)

The mean rent paid by all retail shops in a large city mall is $950 with a standard deviation of
$225. However, the population distribution of rents for all retail shops in this city mall is
skewed to the right. A sample of 100 shops was taken.
(a) Will the sampling distribution of 𝑥̅ be normal? Explain.
(b) Find the probability that the mean rent exceeds $990.

Solution (a):
Although the population distribution of rents paid by all retails shops is not normally
distributed, the sample size 100 is large (n³ 30). Hence, the Central Limit Theorem (CLT) can
be applied to infer the shape of the sample distribution of 𝑥̅ . The sampling distribution based
on CLT will be normal.

Solution (b):

𝜇Q̅ = 𝜇 = 950

¥¥Vg¥XV
𝑃(𝑥̅ ≥ 990) = 𝑃(𝑧 ≥ LLX )
D
√dVV
= P(𝑧 ≥ 1.78)
= 0.5 – 0.4625
= 0.0375

127
Example 8.7
(Small sample, normal population distribution)

Upper primary school children have allowances that are approximately normally distributed
about a mean of $39 per week and a standard deviation of $2. A random sample of 25 children
is taken and the mean is calculated. What is the probability that this mean value will be between
$38.50 and $40?

Solution:
Although the sample size is small (n<30), the shape of the sampling distribution of 𝑥̅ is
normal because the population is normally distributed.

TY.XgT¥.V WV.VgT¥.V
𝑃(38.5 ≤ 𝑥̅ ≤ 40) = 𝑃( LD ≤𝑧 ≤ LD )
√LX √LX
= P(-1.25 ≤ 𝑧 ≤ 2.50)
= 0.3944 + 0.4938
= 0.8882

128
6. Discussion Questions

1. A variable x is normally distributed with a mean 50 and standard deviation of 4. Find


(a) P(x>55)
(b) P(44≤x≤55)
(c) P(52≤x≤55)

2. Assume the distribution of monthly food expenditures for a family of four follows the
normal distribution, with a mean of $490 and a standard deviation of $90.
(a) What is the probability that a selected family spends less than $430 on food?
(b) What is the probability that a selected family spends between $500 to $600 on
food?
(c) It is known that 10% of families spent below $X. Find the value of X.
(d) Is it likely for a selected family to spend more than $800 on food? Justify.

3. A normal population has a mean of 75 and a standard deviation of 5. You select a


sample of 40. Compute the probability that the sample mean is
(a) Less than 74
(b) Between 76 and 77

4. Assume that the number of weekly study hours for students at a certain university is
approximately normally distributed with a mean of 22 and a standard deviation of 6.
(a) Find the probability that a randomly chosen student studies less than 12 hours.
(b) A certain lecture group consists of 225 students. You may assume that this
group forms a simple random sample from the students in the university. Find
the probability that the average number of study hours is between 21 and 23
hours.

5. The transport claims made by marketing managers of Alliance Global follow a normal
distribution with a mean of $490 per month and standard deviation $80.
(a) What is the probability that a randomly selected manager has a transport claim
of more than $600?
(b) Suppose a sample of 49 managers was selected. What is the probability that the
mean transport claim is greater than $470?

129
7. Supplementary Questions

1. The average return achieved by people who invested in Real Estate Investment Trusts
or REITS is normally distributed with mean 9 percent and standard deviation 1.2
percent. Find the probability that a randomly selected investor achieved a return of
(a) more than 10 percent
(b) between 8 to 9.5 percent.

2. The life of a Model J7 electric shaver has a normal distribution with mean 65 months
and standard deviation of 6 months. The company is providing a warranty period such
that it does not replace more than 1% of the shavers. What is the warranty period
(months)?

3. YCH Logistics pays its part-time employees an average wage of $6.40 an hour with a
variance of $0.64. If the wages are approximately normally distributed,
(a) What percentage of the employees receive wages between $5.50 to $6.60?
(b) The bottom 20% of employees receive wages less than $X an hour. What is
the value of X?
(c) What is the probability that a sample of 36 employees will have a mean wage
of less than $6.10 an hour?

4. The annual commissions earned by sales representatives of AB Insurance. follow the


normal distribution. The mean (µ) yearly amount earned is $40,000 and the standard
deviation (s) is $5,000.
(a) What percent of the sales representatives earn more than $42,000 per year?
(b) What percent of the sales representatives earn between $32,000 and $42,000?
(c) What percent of the sales representatives earn between $32,000 and $35,000?
(d) The sales director wants to award the sales representatives who earn the largest
commissions a bonus of $1,000. He can award a bonus to 15% of the
representatives. What is the cutoff point between those who earn a bonus and
those who do not?

5. Assume that the weights of all packages for a certain brand of chocolate bar are normally
distributed with mean of 32 grams and a standard deviation of 0.3 grams. Find the
probability that the mean weight of a random sample of 20 packages of this brand of
chocolate bar will be between 31.8 to 31.9 grams.

6. The time taken to learn a major sewing job for a new worker hired in the production
department of a garment factory is normally distributed with a mean of 80 hours and a
standard deviation of 6 hours. Find the probability that the mean time taken to learn
this job by a random sample of 16 new workers would be
(a) between 76 and 78 hours.
(b) within 4 hours of the population mean
(c) more than the population mean by at least 3.5 hours.

130
7. The amount of monthly car parking charges incurred by car drivers in a town have a
skewed distribution with a mean of $65 and a standard deviation of $25. Find the
probability that the mean amount of car parking charges for a random of sample of 75
drivers selected from this town will be
(a) more than $70.
(b) between $58 and $63.
(c) less than the population mean by at least $5.

8. A Company, which manufactures dispensing machines for hot beverages, sets the fill
level at 197.5cc. The filling process gives a standard deviation of 5cc. The fill levels are
normally distributed.
(a) What is the probability that a randomly selected drink contains less than 190cc?
(b) What is the probability that a random sample of 50 drinks has a mean value
greater than 199cc?
(c) The company claims that an average drink is 200cc. What percentage of the
sample means are 200cc or more if samples of size 36 are taken?

9. The final scores of a management module follow the normal distribution. The mean of
the distribution is 74 and the standard deviation is 5. The professor wishes to award an
A grade only to the students whose scores belong to the highest 3%. Calculate the
dividing point for those students who earn an A grade and those who do not.

10. A machine is programmed to fill up bags of cement for industrial use. The amount filled
up per bag is normally distributed with mean 15 kg and a standard deviation of 1.5 kg.
(a) What percent of the bags contain between 15 to 16.4 kg?
(b) Bags that contain less than 13 kg are considered to be under-filled and will be
rejected for sale. What percentage of bags are under-filled?

131
BUSINESS STATISTICS

SESSION 9

ESTIMATION

At the end of the session, students should be able to:

1. explain the difference between a point estimate and an interval estimate.


2. use normal distribution to construct a confidence interval for population mean or
proportion.
3. use t distribution to construct a confidence interval for population mean
4. decide whether normal or t distribution should be used in constructing confidence
interval for population mean.
5. determine a sample size at a specified level of confidence and margin of error.

___________________________________________________________________________

1. Introduction

Statistical inference is the process of using sample results to draw conclusions about the
characteristics of a population. In this chapter, we shall examine statistical procedure that will
enable us to estimate the true population mean (µ) and population proportion (π). This
procedure is known as Estimation.

2. Point and Interval Estimates

There are two types of estimates namely a point estimate and an interval estimate that can be
used to estimate the true population parameter.

2.1 Point Estimate

If we select a sample and compute the value of the sample statistic for this sample, this sample
value gives the point estimate of the corresponding population parameter. For example, you
take a sample of students from a university and found that the mean travelling time to the
university for this sample is 45 minutes. Then, using the sample mean (𝑥̅ ) as a point estimate
of the population mean (µ), we can say the mean travelling time for all students is about 45
minutes.

Hence, we have the following commonly used point estimators to estimate population
parameters.

Point Estimator Parameter


Sample mean, 𝑥̅ Population mean, µ
Sample standard deviation, s Population standard deviation, 𝜎
Sample variance, 𝑠 L Population variance, 𝜎 L
Sample proportion, p Population proportion, π

132
2.2 Interval Estimate

When we do a point estimate, we can never be sure that 𝑥̅ , which is based on sample data is
equal to µ. A point estimate is insufficient for making reliable inferences about the population
parameter. Hence, we should construct an interval estimate which serves the purpose of making
inferences better.

In the case of interval estimate, instead of using a single value, we use a range of values within
which the true value of the population parameter is likely to be included. For example, instead
of saying that the mean travelling time is 45 minutes, we could subtract and add 15 minutes to
45 minutes and then say that the mean travelling times ranges from 30 minutes to 60 minutes.
This is known as an interval estimate.

The question arises, what number should we subtract from and add to a point estimate in order
to obtain the interval estimate? The width of this estimate depends on two considerations:
¤
• The standard error ( ) and
√&
• The required level of confidence e.g. 95%

We always attach a probabilistic statement to the interval estimate. This statement is given by
the confidence level, for example, 95%. An interval that is constructed based on this confidence
level is called a confidence interval. Although any value of confidence level can be chosen,
the commonly used ones are 90%, 95% and 99%. In general, the level of confidence is
symbolized by (1- a) where a is the proportion in the tails of the distribution which are outside
the confidence interval. See Figure 9.1

Figure 9.1 Interval Estimation

The meaning of confidence level e.g.95%

If we select all possible samples of size n, and if we calculate the confidence interval for each
of these samples, then 95% of such intervals will contain the true population mean.

3. Interval Estimate of a Population Mean : Large Samples or 𝝈 known

A sample size is considered large when n is 30 or larger. When the sample size is large, we
will use the z distribution to construct a confidence interval for µ. The confidence interval for
µ is
𝜎
𝑥̅ ± 𝑧
√𝑛

133
¤
is known as the standard error of the mean
√&

¤ ¤
𝑧 is known as the margin of error, E (i.e. E = 𝑧 )
√& √&

¤
Note: The formula can also be written as 𝑥̅ ± 𝑧¯DL . We will leave out writing the subscript
√&
𝛼D in this coursebook.
2

In practice, we often do not know the value of the population standard deviation, 𝜎. Whenever
𝜎 is unknown, it can be estimated by the sample standard deviation, s. The confidence interval
for µ will then be
𝑠
𝑥̅ ± 𝑧
√𝑛

The steps to obtain the z value for any given confidence level (for example 95%) are:
(a) Divide 0.95 by 2 which gives 0.4750
(b) Refer to Appendix A2 to look for 0.4750 in the body of the table and then record the
corresponding z value. This value is 1.96 for a 95% confidence. See Figure 9.2

Figure 9.2 Finding z for a 95% confidence level

Example 9.1
The standard deviation of the length of J8 stainless steel bolts produced by a machine is known
to be 4.5 mm. For a simple random sample of 36 bolts, the average length is 48.4 mm. What
is the 90% confidence interval for the mean length of bolts produced by the machine?

Solution:
n=36 𝑥̅ = 48.4 𝜎 = 4.5

90% confidence interval: For 90% confidence level, we have to


locate 0.45 from appendix 2. The number
¤ W.X
𝑥̅ ± 𝑧 = 48.4 ± 1.645 closest of 0.45 is either 0.4495 or 0.4500
√& √TZ
which gives z=1.64 and z=1.65 respectively.
= 48.4 ± 1.24
The average 1.645 is normally used
= (47.16; 49.64) mm although 1.64 and 1.65 are also acceptable.
Thus, we are 90% confident that the mean length is between 47.16 to 49.64 mm.

134
Example 9.2
For a sample of 35 renovation jobs, Albert found that the mean number of days taken to
complete a renovation job for a new HDB flat is 42 days with a sample standard deviation of 8
days. Construct a 95% confidence interval for the mean number of days taken to complete a
renovation job.

Solution:
n=35 𝑥̅ = 42 𝑠=8

95% confidence interval:


4 Y
𝑥̅ ± 𝑧 = 42 ± 1.96
√& √TX
= 42 ± 2.65
= (39.35; 44.65) days

Therefore, 95% confidence interval for the mean number of days taken to complete a renovation
job falls between 39.35 to 44.65 days.

4. Interval Estimate of a Population Mean: Small samples and unknown 𝝈

When the sample taken is small (n < 30) and the population standard deviation, s is unknown,
the normal distribution is replaced by the t distribution to construct confidence intervals about
µ. We use the sample standard deviation, s as a point estimate of s.

So, the conditions for use of t distribution are:


- Sample size n < 30 AND
- Population standard deviation, s is unknown.
Also, we have to assume that the population from which the sample was drawn is approximately
normally distributed.

The characteristics of the t-distribution are:


• It is symmetric about the mean like the normal distribution.
• It is bell-shaped – but ‘flatter’ and more ‘spread out’ than normal distribution. That is,
lower at the mean, heavier at the tail compared with normal distribution.
• As the sample size increases, t distribution converges to normal distribution.
• It is a family of distribution specified by ‘degrees of freedom’ (d.f. = n–1). See Figure 9.3.

Figure 9.3 The t distribution for various degrees of freedom

135
The confidence interval for µ is
𝑠
𝑥̅ ± 𝑡
√𝑛

The value of t is obtained from the t distribution table (Appendix 3) for n-1 degrees of freedom
and the given confidence level.

Example 9.3
Find the t-value for a sample size of 18 and a 95% confidence level.

Solution:
When n=18, df = n-1 =17.
When confidence level is 95%, the area in the two tails combined (also known as a) is equal to
5% or 0.05. Alternatively, if you look at each tail independently then it will be a/2 or 0.025
(one tail).

Referring to Appendix 3, the corresponding t-value is 2.110. See Figure 9.4

Figure 9.4 Determining t value

Example 9.4
An accounting firm would like to set up a guideline for the time required to complete a certain
type of audit operation. A sample of auditing times from 18 different junior auditors was
obtained with a mean time of 3.2 hours and a standard deviation of 1.6 hours. Determine a 95%
confidence interval for the average time required in completing such type of auditing.
(a) Explain why the t-distribution should be used.
(b) Find the t-value for a 95% confidence level.
(c) Construct a 95% confidence interval.

Solution (a):
There are 2 conditions for using the t-distribution: n <30 and 𝜎 unknown.

136
Solution (b):
n=18 𝑥̅ = 3.2 𝑠 = 1.6
df = n – 1 = 18 -1 =17
t-value = 2.110 (from Appendix 3)

Solution (c):
95% confidence interval:
4 d.Z
𝑥̅ ± 𝑡 = 3.2 ± 2.110
√& √dY
= 3.2 ± 0.796
= (2.404; 3.996) hours

Therefore, the 95% confidence interval for the mean time required in completing such type of
auditing is from 2.404 hours to 3.996 hours.

5. Interval Estimate of a Population Proportion

We may often want to estimate the population proportion or percentage. For example, a
company may want to know the proportion of defective items received in a shipment. A hotel
may want to find the percentage of hotel guests who are satisfied with the service of the hotel.

The population proportion is denoted by π and the sample proportion by p. The confidence
interval for the population proportion is

𝑝(1 − 𝑝)
𝑝 ± 𝑧}
𝑛

The z value is obtained from the standard normal table (Appendix 2) for a given confidence
level. This value is located in the same way that was done for large sample estimates of µ.
K(dgK)
k is the estimated standard error for proportions.
&

Example 9.5
A food company found that in a sample of 100 purchase orders, 10 contain errors. Find the
95% confidence interval of the population proportion of purchase orders that contain errors.

Solution:
Q dV
n = 100 sample proportion, 𝑝 = & = dVV = 0.1
z = 1.96 (look up 0.4750 from the body of standard normal table, Appendix 2)

95% confidence interval:


𝑝(1 − 𝑝)
𝑝 ± 𝑧}
𝑛
V.d(dgV.d)
= 0.1 ± 1.96k dVV
= 0.1 ± 0.0588

= (0.041; 0.159)

137
Therefore, the 95% confidence interval for the proportion of purchase orders with errors lies
between 4.1% to 15.9%.

6. Factors Affecting the Width of a Confidence Interval

The 3 factors that affect the width of a confidence interval are :


(a) Level of Confidence
The higher the level of confidence, the wider the Confidence Interval. A higher confidence
will have a larger z value resulting in a greater margin of error. Hence, the interval is wider.
(b) Sample Size
The larger the sample size, the narrower the Confidence Interval. A larger n will lead to a
¤
smaller standard error ± ² resulting in a narrower or more precise interval.
√&

(c) Variability of Population

The larger the population variability (measured by 𝜎 or 𝜎 L ), the wider the Confidence Interval.
A larger 𝜎 will increase the standard error leading to a wider or less precise interval.

7. Determining Sample Size

Samples and not a census are almost always used in research because of limited resources. If
we know the confidence level and the width of the confidence interval that we want, then we
will be able to find the approximate sample size that will produce the required result.

The following formulas will help us determine the required sample size, n.

Sample size for the estimation of µ Sample size for the estimation of π

´¤ L ´ L
𝑛 = ³µ¶ 𝑛 = 𝑝(1 − 𝑝) ³µ ¶

where E represents the margin of error or maximum tolerance limit.


If s is unknown, then the sample standard deviation, s is used.
If p is unknown, we will use p=0.5

Note that the final answer for the sample size should be rounded UP to a whole number. This
is always the case when determining sample size to ensure that the conditions of confidence
level and margin of error are met.

Example 9.6
A researcher wishes to know the mean annual income of human resource managers in the
manufacturing industry. How large a sample is required if he wants to be 95% confident that
the estimate is within $5000 of the true population mean annual income? Assume the
population standard deviation is $30,000.

138
Solution:
95% confidence level à z = 1.96
Margin of error, E = 5000
𝜎 = 30,000

𝑧𝜎 L
𝑛=³ ¶
𝐸

d.¥Z ¸ TVVVV L
= ³ ¶
XVVV

= 138.3

Hence, minimum sample size is 139 (round up).

Example 9.7
Lim Electronics has just installed a new machine that makes a part that is used in car autolocks.
The company wants to estimate the proportion of these parts produced by the machine that are
defective. The manager wants this estimate to be within 0.03 of the population proportion for
a 99% confidence level. What is the minimum sample size required?

Solution:
99% confidence level à z = 2.575
Margin of error, E = 0.03

Since we have no prior information about the value of p, we shall use p=0.5.
𝑧 L
𝑛 = 𝑝(1 − 𝑝) ³ ¶
𝐸

L.XUX L
= (0.5)(0.5) ³ V.VT ¶

= 1841.8

Hence, minimum sample size is 1842 (round up).

139
8. Discussion Questions

1. A recent survey of 50 executives who were laid off from their previous position revealed
it took a mean of 26 weeks for them to find another position. The standard deviation is
known to be 6.2 weeks.
(a) What is the point estimate of the population mean, µ?
(b) Construct a 95% confidence interval for the population mean.
(c) A manpower ministry personnel says that the mean duration taken to find a job
after being laid off is 20 weeks. Is this estimate reasonable?

2. An insurance company reported that it paid out many claims last year for car accidents.
In a sample of 64 claims made this year, the mean claim amount was found to be $7,300
with a sample standard deviation of $1,200.
(a) Construct the 98% confidence interval for the population mean claim amount
for car accidents.
(b) Suppose a mistake was made in the computation of the standard deviation. The
value of the standard deviation is supposed to be larger. How would this affect
the width of the interval?

3. Furniture Land surveyed 600 consumers and found that 414 were enthusiastic about a
new décor plan they plan to show in a major home exhibition. Construct the 99%
confidence interval for the population proportion.

4. The Health Promotion Board wants to estimate the mean yearly milk consumption. A
sample of 16 people reveals the mean yearly consumption to be 60 litres with a standard
deviation of 20 litres.
(a) Explain why we need to use the t distribution. What assumption do you need to
make?
(b) For a 90% confidence interval, what is the value of t?
(c) Develop the 90% confidence interval for the population mean.

5. Family Health, a publisher of health magazine wants to determine the mean insurance
premium paid by its subscribers. If the population standard deviation is $1,000, what
sample size is needed if the firm wants to be 99% confident of being correct to within
± $250?

6. A survey of 20 teachers found that the mean age of the teachers is 40.6 years old, with
a sample standard deviation of 9.5 years.
(a) Find the 99% confidence interval of the population mean.
(b) An additional 20 teachers were surveyed. The mean age remained at 40.6 years
and the sample standard deviation remained at 9.5 years. 30 out of the 40
teachers have more than 10 years of teaching experience.

(i) Find the 95% confidence interval for the population mean age. (Express
answers to 1 decimal place)
(ii) Find the 95% confidence interval for the proportion of teachers with
more than 10 years teaching experience. (Express answers to 3 decimal
places)

140
7. BBS Bank has been providing incentives for consumers to use mobile banking more
extensively. Susan collected some data from a random sample of 100 customers and
found that the average number of mobile banking transactions each month for these
customers was 8.8 transactions with a standard deviation of 2.8 transactions.

(a) Find the 95% confidence interval for the population mean number of mobile
banking transactions per month.

(b) BBS Bank has set a goal of achieving a mean number of 10 mobile banking
transactions per month. Based on the result in part (a), has the bank achieved
its target? Explain.

(c) What is the minimum sample size required if the bank wants to estimate the
mean number of mobile banking transaction to within 0.6 transactions with a
98% confidence?

141
9. Supplementary Questions

1. The average time in days required to deliver orders by an electrical company is to be


estimated. A sample of 60 orders is selected randomly from recent trading. The sample
mean is 5.9 days and the sample standard deviation is 1.7 days. Calculate a confidence
interval for the mean delivery time at
(a) the 95% level
(b) the 90% level.

2. A random sample of 150 people had a mean weight of 71.2kg with a standard deviation
of 4.9kg. Construct a 90% confidence interval for the mean weight of the population
from which this sample was taken.

Singaporeans have a mean weight of 68.7kg. Is it likely that that this sample taken was
a sample of Singaporeans? Explain your answer.

3. A Company is considering introducing a new scheme of shift work. They would like to
know whether the scheme is favourable to the majority of workers before they introduce
it. A random sample of 73 workers showed 43 in favour.
Construct a 95% confidence interval and advise the company how they should act.

4. A large company is looking at the time taken by workers to complete a particular job in
a plant. A sample of 41 workers showed a mean time of 34.3 minutes with a standard
deviation of 2.5 minutes. Give a 97% confidence interval for the mean time taken by
workers to complete the job.

5. The standard crop yield for a certain kind of vegetable averages 76 kg per square metre
of land plot. A new fertilizer is applied to a sample of 5 separate one-square metre plots.
The crop yields recorded are:
83, 81, 87, 79, 77
(a) Compute the sample mean and sample standard deviation.
(b) Construct a 95% confidence interval and comment on the results. Does the
fertilizer seem to be making a difference to crop yield? Assume the distribution
of crop yields are normally distributed.

6. A recent survey involving a random sample of 25 students in a private school showed


that the students utilize the library on average 1.8 times per week with a standard
deviation of 0.4.
(a) Explain why the t-distribution should be used to compute the confidence
interval? What assumption must you make?
(b) Construct a 95% confidence interval for the mean number of times that a student
will visit the library per week.

7. When a sample of 70 retail managers were surveyed regarding the poor performance of
the retail industry in the recent quarter, 65% believed decreased sales were due to a
recent increase in goods and services tax (GST).

Find the 95% confidence interval for the proportion of retail managers who believed
that decreased sales were due to increase in GST.

142
8. The standard deviation for a population is 𝜎 =16.4. A sample of 100 observations
selected from this population gave a mean equal to 143.72.
(a) Construct a 90% confidence interval for µ.
(b) Construct a 95% confidence interval for µ.
(c) Construct a 99% confidence interval for µ.
(d) Does the width in parts (a) to (c) increase as confidence level increases?
Explain.

9. The standard deviation for a population is 𝜎 =6.30. A sample of 100 observations


selected from this population gave a mean equal to 78.90.
(a) Construct a 99% confidence interval for µ assuming n=36.
(b) Construct a 99% confidence interval for µ assuming n=81.
(c) Construct a 99% confidence interval for µ assuming n=100.
(d) Does the width in parts (a) to (c) decrease as sample size increases? Explain.

10. As an executive of the Consumer’s Association, you took a random sample of 10 cans
of baked beans at a canning plant. The net weights of the beans (in ounces) are reported
in the table below.
16.2 16.1 15.6 15.8 16.2
16.1 15.9 16.0 15.7 15.9

(a) Determine the mean weight of beans in these cans.


(b) Given that the sample standard deviation is 0.0428 ounces and assuming that the
weights per can are normally distributed, construct a 99% confidence interval
for the mean weight per can of beans.
(c) State 2 ways on how the width of the interval computed in (b) may be decreased.

143
BUSINESS STATISTICS

SESSION 10

ONE-SAMPLE HYPOTHESIS TESTING

At the end of the session, students should be able to:

1. transform problems into appropriate null and alternative hypotheses.


2. carry out a hypothesis test on a population parameter (mean & proportion).
3. understand level of significance & Type I and Type II errors.
4. understand how confidence interval relates to hypothesis testing.
___________________________________________________________________________

1. Introduction

In a test of hypothesis, we test a certain given belief about a population parameter. If someone
makes a claim about the general population value, how can we substantiate that?

Hypothesis testing allows us to evaluate the situation using sample information and then
conclude if the claim is true or has to be rejected. Obviously, a sample value is usually different
from the claim about the population value – the task is to judge if the “observed difference”
between a sample statistic and the hypothesized value of the population parameter is
statistically significant.

2. Key Terms in Hypothesis Testing

2.1 Null Hypothesis and Alternative Hypothesis

The null hypothesis (Ho) is a claim (or statement) about a population parameter that is assumed
to be true. The alternative hypothesis (H1) is the opposite of Ho. It is a claim that will be true
if the null hypothesis is false.

Assuming we are testing a claim about the population mean (µ), there are three possible choices
of formulating Ho and H1.
Ho : µ = µ0 e.g. Ho : µ = 10
Left-tailed test:
H1 : µ < µ0 H1 : µ < 10

Right-tailed test: Ho : µ = µ0 e.g. Ho : µ = 10


H1 : µ > µ0 H1 : µ > 10

Two-tailed test: Ho : µ = µ0 e.g. Ho : µ = 10


H1 : µ ¹ µ0 H1 : µ ¹ 10

where µ denotes the population mean and


µ0 denotes the hypothesized population mean value.
Note: H1 has burden of proof and the “=” sign always appears in Ho.

144
2.2 Significance Level

In doing hypothesis testing, a significance level (a) is set. a is a probability or area that
represents the probability of rejecting Ho when it is true. This is further explained under Part 4
(Types of Errors).

2.3 Test statistic

This is either a “z” or “t” value calculated based on sample information. It is a quantity used
in deciding whether or not to reject the Ho.

The formulas to calculate the test statistic are as follows:

Testing a mean when n ³ 30 or 𝜎 known:

𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛

Testing a mean when n < 30 and 𝜎 unknown:

𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛

Testing a proportion:
𝑝−𝜋
𝑧=
k𝜋(1 − 𝜋)
𝑛

2.4 Rejection and Non-Rejection Regions

The size of the rejection region depends on the value of the significance level (a). Although
any value can be assigned to a, the commonly used values of a are 0.01, 0.05 and 0.10. We
may have one or two rejection regions depending on whether it is a left-tailed, right-tailed or
two-tailed test. (Refer Figure 10.1). The regions outside the shaded region a, is the non-
rejection region.
Left-tailed test Right-tailed test Two-tailed test

Figure 10.1 Rejection Regions

145
The rejection region is also known as the critical region. For a two-tail test, the area or a is
split into the two tails. The rejection region in each tail is 𝛼D2.

2.5 Critical Value(s)

The critical value is a z value or t value corresponding to specified a. It serves as the


boundary or cut-off point between the rejection and non-rejection region. See Figure 10.2.

Figure 10.2 Critical values

3. Steps in Hypothesis Testing

There are five basic steps to follow when performing a hypothesis test:

Step 1: Formulate the null and alternative hypotheses.

Step 2: Specify the level of significance.

Step 3: Calculate the value of the test statistic (z or t).

Step 4: Determine the critical value and form the decision rule.

Step 5: Draw conclusion.

We shall look at various situations to see how these steps are carried out.

3.1 Hypothesis Test about a Population Mean: Large Sample or 𝝈 known

Based on the Central Limit Theorem, the sampling distribution of 𝑥̅ is approximately normal
for large samples (n ³ 30). Hence, whether 𝜎 is known or unknown, the normal distribution
is used to test the hypothesis about a population mean whenever we have a large sample.

Example 10.1
A polyclinic uses a certain drug with a mean packaged dose of 100 cm3. The standard deviation
is known to be 3 cm3. A random sample of 36 doses is selected and the mean dosage was found
to be 101cm3. Test at 0.01 significance level whether the mean dosage in the packages is larger
than 100 cm3?

146
Solution:
Step 1 Ho : µ £ 100
H1 : µ > 100 [This is a right-tailed test]
Step 2 a = 0.01 n = 36 s = 3 X = 101 à Use Z since n is large.
Q̅ gh
Step 3 Test statistic, 𝑧 = º
√»

dVd g dVV
= ¼ = 2.0
√¼½

Step 4 Critical value, z = 2.33


[Look for area =0.49 in Appendix 2 to obtain the critical z value]
Decision Rule: Reject Ho if test statistic > 2.33
Step 5

Since test statistic falls outside rejection region, we do not reject H0. There is
insufficient evidence to conclude that the mean dosage is larger than 100 cm3 at
a = 0.01 level of significance.

Example 10.2
A soft drink manufacturer claims that each soft drink can has a volume of 150 ml. A sample
of 40 soft drink cans were randomly selected for quality control check and the sample mean
was found to be 130 ml with a standard deviation of 60 ml. Test at 0.01 level of significance
whether the average volume of soft drink in the cans differ from 150 ml.

Solution:
Step 1 Ho : µ = 150
H1 : µ ¹ 150 [This is a two-tailed test]
Step 2 a = 0.01 n = 40 s = 60 X = 130 à Use Z since n is large.
Q̅ gh
Step 3 Test statistic, 𝑧 = ¾
√»

dTV g dXV
= ½¿ = −2.11
√À¿

Step 4 Critical value, z = ± 2.575


[Look for area =0.4950 in Appendix 2 to obtain the critical z value]
Decision Rule: Reject Ho if test statistic > 2.575 or test statistic < -2.575

147
Step 5

Since test statistic falls outside the rejection region, we do not reject H0. There
is insufficient evidence to conclude that the mean volume of soft drink in the
cans differ from 150 ml at a = 0.01 level of significance.

3.2 Hypothesis Test about a Population Mean : Small Sample and 𝝈 unknown

When a population is approximately normally distributed but the population standard deviation
𝜎 is unknown and the sample size is small (n<30), the normal distribution is replaced by the t
distribution to make a hypothesis test about µ.

The steps to conduct the test in the case of small samples is similar to the one for large samples.
The only difference is the use of the t distribution in place of the normal z distribution.

Example 10.3
A new manager at JE Country Club has been told by his predecessor that the club’s members
have an average length of membership of 8.7 years. The manager took a random sample of 15
membership files, and he found the mean length of membership to be 7.2 years with a standard
deviation of 2.5 years. Assume the length of the membership in this club is normally distributed.
At a 0.05 level of significance, does this sample result suggest that the actual mean length of
membership in this club may be less than 8.7 years?

Solution:
Step 1 Ho : µ ³ 8.7
H1 : µ < 8.7 [This is a left-tailed test]

Step 2 a = 0.05 n = 15 X = 7.2 s = 2.5

Step 3 We use t-distribution since n < 30 and 𝜎 is unknown.


Q̅ gh
Test statistic, 𝑡 = ¾
√»
U.L gY.U
𝑡= j.Á = −2.32
√ÂÁ
Step 4 df = n-1
= 15 -1 =14
Critical t = - 1.761
[Look for df=14, column 0.05(one-tail) in Appendix 3 to obtain critical t- value]
Decision rule : Reject Ho if test statistic < - 1.761

148
Step 5

Since test statistic < -1.761, we reject Ho. The data seems to suggest that the
average membership length at this club is less than 8.7 years at a = 0.05.

3.3 Hypothesis Test about a Population Proportion

We may sometimes want to conduct a test of hypothesis about a population proportion. For
example, a company may claim that 99% of their orders are shipped on time. The quality
control department may want to check from time to time whether this claim is true.

The testing procedure involves the same five steps.

Example 10.4
300 consumers were sampled and it was found that 37% used Brand A toothpaste. A similar
study conducted 5 years ago showed that 32% of the consumers used Brand A toothpaste. At a
10% significance level, is there evidence that there is a change in the proportion of consumers
using Brand A toothpaste?

Solution:
Step 1 Ho : π = 0.32
H1 : π ¹ 0.32 [This is a two-tailed test]
Step 2 a = 0.1 n = 300 p = 0.37 à Use Z
KgÃ
Step 3 𝑧=
kÄ(ÂÅÄ)
»
V.TUgV.TL
𝑧= = 1.86
k¿.¼j(ÂÅ¿.¼j)
¼¿¿
Step 4 Critical value, z = ± 1.645
[Look for area =0.45 in Appendix 2 to obtain the critical z value]
Decision rule: Reject Ho if test statistic > 1.645 or test statistic < -1.645

Step 5

149
Since test statistic > 1.645, we reject Ho. There is evidence that the proportion
of consumers using Brand A toothpaste has changed at a = 0.1 level of
significance.

4. Types of Errors in Testing

When we make a decision, it would be nice if it were always a correct decision. This, however,
is statistically impossible, since we are making decisions on the basis of sample information.
The best we can hope for is to control the risk, or the chance with which an error occurs.

There are four possible outcomes in hypothesis testing that could be reached as shown in
Table 10.1.

Actual Situation

H0 is true H0 is false

Correct Type II error


Decision In favor of H0 Decision (b)

Reject H0 Type I error Correct


( a) Decision
Table 10.1 Types of Errors

A correct decision occurs when Ho is true and we do not reject Ho. A correct decision also
occurs when Ho is false and our decision is to reject Ho.

A Type I error occurs when a true null hypothesis is rejected, that is, when the null hypothesis
is true but we decided against it. The probability assigned to Type I error is a which is known
as the significance level of the test.

A Type II error occurs when we fail to reject a Ho that is actually false. The probability
assigned to Type II error is b. We can calculate the probability of a Type II error if and only
if the Ho is false and we know the true (actual) population value. The computations will not be
dealt with here.

There is an inverse relationship between a and b. Hence, a lower a is not necessarily better.

5. Hypothesis Testing using the p-value approach

In our previous examples, the value of the significance level a is given. We then used the
sample test statistic to compare to the critical value for the given a to draw our conclusion.

In the probability-value approach, more commonly known as the p-value approach, the sample
test statistic will not be compared with the critical value. We will instead find the probability
of getting a sample statistic value as extreme as, or more extreme than, the sample statistic
value under study. Hence, the p-value may be defined as the smallest significance level (a) at
which the Ho is rejected.

150
The p-value will be compared with a to make a decision about the hypothesis.

p-value > level of significance (a) à Do not reject H0


p-value < level of significance (a) à Reject H0

For a one-tailed test, the p-value is given by the tail area beyond the value of the sample statistic.
See Figure 10.3.

Figure 10.3 p-value for a right-tail test

For a two-tail test, the tail area beyond the value of the sample statistics is multiplied by 2.
See Figure 10.4

Figure 10.4 p-value for a two-tailed test

Example 10.5
The management of the sports club at a university claims that students use the gym about 10
times or more each year on average. To check this claim, a trainer at the gym took a sample of
36 students and found that the number of visits was 9.2. The population standard deviation is
known to be 2.4 times. The trainer would like to test whether the mean number of times that
students use the gym is lower than what was claimed. Find the p-value for this test.

151
Solution:
n = 36 X = 9.2 s = 2.4

To find the p-value, we first find the z value for 𝑥̅ = 9.2.


¥.LgdV
At 𝑥̅ = 9.2, z = L.W/√TZ = −2.00

The area to the left of 𝑥̅ = 9.2 (or z < -2.00) is the p-value. From the standard normal table
(Appendix 2), we will get 0.4772 which is the area between the mean and z = -2.

Hence, p-value = 0.5 – 0.4772 = 0.0228.

As the trainer is testing whether the mean is lower than 10, this would be a left-tail test.

Suppose a is 1%, we have p-value, 0.0228 > 0.01. Ho will not be rejected.
Suppose a is 5%, we have p-value, 0.0228 < 0.05. Ho will be rejected.

Although the conclusion for the test depends on the significance level set, we can say that the
smaller the p-value, the higher the chance that Ho will be rejected.

We do not need to know the critical value when we use the p-value approach in hypothesis
testing.

152
6. Discussion Questions

1. Delmont, a manufacturer of chilli sauce uses a particular machine to dispense 16 ounces


of its chilli sauce into containers. Delmont knows the amount of product in each
container follows a normal distribution with a mean of 16 ounces and a standard
deviation of 0.15 ounce. A sample of 50 containers filled last hour revealed the mean
amount per container was 16.017 ounces. The manufacturer wishes to test whether the
mean amount is different from 16 ounces.

(a) State the null hypothesis and the alternate hypothesis.


(b) At the 0.05 significance level, is there evidence to suggest that the mean amount
dispensed is different from 16 ounces? Show all steps with appropriate
workings.

2. The mean life of a battery used in a digital clock is 305 days. The lives of the batteries
follow the normal distribution. The battery was recently modified to last longer. A
sample of 49 of the modified batteries had a mean life of 311 days with a standard
deviation of 12 days. Did the modification increase the mean life of the battery? Show
all steps in a hypothesis test with appropriate workings. Use 5% significance level.

3. The mean time required to perform a certain job on a factory floor is 20.5days. A
random sample of 16 employees is taught using a new method to perform the job. After
training, the mean time taken by these 16 employees is 17 days with a standard deviation
of 5 days. Assume that the time required to perform the task is normally distributed.

Do the results provide sufficient evidence to indicate that the time taken to perform the
job has reduced under the new method? Use α=0.01

4. A firm has decided that it will market a new product only if at least 35% of the people
like it. A random sample of 400 persons shows that only 128 or 32% said they liked it.

Using a=5% can the firm conclude that the proportion of people liking the product is
less than 35%? Show all steps clearly.

5. With internet banking, BBS Bank believes that customers have reduced their number
of visits to the Bank to perform banking transactions. Susan was asked to test whether
the average number of visits to the bank by personal consumers has reduced from last
year’s mean of 20 visits. A sample survey of 25 customers revealed a mean of 18 visits
per year with a sample standard deviation of 3.5 visits.

(a) State the null and alternative hypotheses.


(b) At the 5% significance level, is there evidence to conclude that the mean number
of visits to the Bank has fallen? Show all steps clearly.
(c) What is the probability of making a Type I error? Explain what this error means.

153
7. Supplementary Questions

1. Meiru was hired as a server at Tea Garden Family Restaurant and was told that she
could receive an average of more than $20 a day in tips. Assuming that the population
of daily tips follows the normal distribution with a standard deviation of $3.24. Over
the first 35 days she was employed at the restaurant, the mean daily amount of her tips
was $24.85. Using the 35 days as the sample, can Meiru conclude that she is earning
an average of more than $20 in tips at the 1% significance level?

2. The management of Kea Furniture is considering a new method of assembling a large


size wardrobe. The present method requires 42.3 minutes, on the average, to assemble
the wardrobe. The mean assembly time for a random sample of 24 wardrobes, using the
new method, was 40.6 minutes, and the standard deviation of the sample was 2.7
minutes. Using the 0.05 level of significance, can we conclude that the assembly time
using the new method is faster?

3. SK-One claims that the mean potency strength of one of its weight reduction capsule
was at least 80. A random sample of 100 capsules were tested and produced a sample
mean potency strength of 78.5 with a standard deviation of 5.1.

(a) Does the data present sufficient evidence to reject SK-One’s claim at a=0.05?
State your null and alternative hypotheses clearly.
(b) What is the chance of making a Type I error? What does a Type I error indicate
in this case?

4. A fresh orange juice vending machine is set to produce cups of fresh orange juice with
an average content of 190 ml. A random sample of 10 cups is drawn, and the average
was found to be 186 ml, with a standard deviation of 5 ml ounce. The company would
like to determine whether this finding is consistent with the hypothesis that the
machinery is operating properly and producing drinks whose average content is 190ml.
Assume that the contents are approximately normally distributed.

(a) State the null hypothesis and the alternate hypothesis.


(b) At the 5% significance level can we conclude that the machine is operating
properly and producing cups of orange juice whose average content is 190 ml?

5. A recent article in the Property Edge magazine reported that the 30-year mortgage rate
is now less than 6%. A sample of 8 small banks in the country revealed the following
30-year rates (in percent)
4.8 5.3 6.5 4.8 6.1 5.8 6.2 5.6

(a) Compute the sample mean and sample standard deviation.


(b) At the 0.01 significance level, can we conclude that the 30-year mortgage rate
for small banks is less than 6%?

154
6. The Quikbite restaurant chain claims that the mean waiting time of customers for
service is 3 minutes with a population standard deviation of 1 minute. The quality
assurance department found in a sample of 50 customers that the mean waiting time
was 2.75 minutes.
(a) At the 0.05 significance level, can we conclude that the mean waiting time is
less than 3 minutes?
(b) What is the p-value? What is your decision regarding the null hypothesis based
on the p-value? Is this the same as the conclusion reached in part (a)?

7. Smucker’s manufactures a variety of fruit jams in glass bottles of 20 ounces. The


manufacturer is clear that the amount of fruit jam in each bottle follows a normal
distribution with a mean of 20 ounces and a standard deviation of 0.7 ounce. A sample
of 30 bottles of fruit jams filled in the last hour showed a mean amount of 19.53 ounces
per bottle. Using the information given above, does this evidence suggest that the mean
amount of fruit jam dispensed is less than 20 ounces? Use a significance level of 0.05.
(a) State the null and alternate hypotheses.
(b) State the decision rule under the conditions stated in (a).
(c) Compute the test statistic.
(d) State your decision regarding the null hypothesis.
(e) Conclude, in a single sentence, the result of the statistical test.
(f) State the p-value and your conclusion regarding the null hypothesis based on the
p-value. Is it the same as the conclusion you have reached in (d)?

8. A fast food outlet would like to prove that the mean sales of its burgers per week is
significantly lower than the overall average of $3,425 for all outlets combined. To do
so, this outlet sampled 40 weeks and found that the mean sales per week was $3,300.
The sample standard deviation was found to be $200.
(a) State the null and alternative hypotheses.
(b) At a=0.02, can the outlet conclude that its sales is significantly lower than
$3,425?

9. A manufacturer of electrical components claims that 99% of his products conforms to


specifications. Eric believes that the proportion is lower than 99%. He took a sample
of 400 electrical components from a large production run and found that 392 conformed
to specifications.
(a) Set up the null and alternative hypotheses.
(b) At a 10% significance level, is there sufficient evidence to conclude that the
proportion of the products that conform to specifications is less than what the
manufacturer claims?

10. When working properly, a machine that is used to make chips for mobile phones does
not produce more than 4% defective chips. Whenever the machine produces more than
4% defective chips, it needs an adjustment. To check if the machine is working
properly, the quality control department takes a random sample of 200 chips and found
14 defective chips.

Test at the 5% significance level whether or not the machine needs an adjustment.

155
BUSINESS STATISTICS

SESSION 11

ANALYSIS OF CATEGORICAL DATA: CHI-SQUARE TEST OF INDEPENDENCE

At the end of the session, students should be able to:

1. organize categorical data into a contingency table.


set up null and alternative hypotheses for a chi-square test of independence.
2. compute expected frequencies, degrees of freedom from a given contingency table.
3. apply chi-square distribution to perform a test of independence.

___________________________________________________________________________

1. Introduction

In Session 10, we focused on testing the value of a particular population parameter such as a
mean, µ or a proportion, p. We now look at another testing procedure, one which deals with
testing for association between two categorical variables.

2. Correlation and Association

In Session 7 (Linear Regression and Correlation), we dealt with whether two quantitative
variables are related to each other or are correlated. It was possible to measure the strength of
correlation between the 2 variables and also to determine to what extent a change in a variable
explains the change in the other variable.

Often time, the variables that we deal with are not numeric variables. For example, we want to
know whether gender and preferred product brand are related. In this case, we will ask whether
there is an association between Gender and Brand. Or, we say “Are gender and brand
independent?” We will need frequency data in order to answer such questions. Thereafter, we
conduct a chi-square test of independence/association.

3. Characteristics of the chi-square distribution

o It is positively skewed.
o It is non-negative.
o It is based on degrees of freedom.
o Whenever the degrees of freedom change, a new distribution is created. (see
Figure 11.1).

156
Figure 11.1 Chi-square distribution curves

The symbol c is a Greek letter and is pronounced as “kye”. The values of a chi-square
distribution are denoted by the symbol c2, just as the values of the standard normal distribution
and the t distribution are denoted by z and t respectively.

4. Contingency Tables

A contingency table is also known as a cross-tabulation. The data in the table represent
frequencies (counts) where observations are organised in cross-tabulated categories.

Example 11.1
Students in non-business courses at a polytechnic were required to choose one of three elective
subjects. The following table show the subjects chosen and the gender of a sample of 300
students.

Elective Subject
Market
Gender Stock Investment Creative Arts Total
Research

Male 93 70 12 175
Female 87 32 6 125
Total 180 102 18 300
Table 11.1 Contingency Table

The cell frequencies are known as observed frequencies and show how the data are spread
across the different combinations of Gender and Elective Subjects. The size of the table is
determined by the number of row categories and number of column categories. Table 11.1 has
2 rows and 3 columns and is known as a 2 X 3 table.

5. Basic Idea of chi-square test

The basic idea of chi-square test is to compare the observed frequencies which we have obtained
from sample data with a set of expected frequencies, conditional on the null hypothesis of no
association between the variables, that is, the variables are independent.

If the observed (actual) and expected (theoretical) frequencies are nearly alike (that is, we
observe only small differences), we can conclude that two variables are not related. If these
frequencies differ substantially, then there is stronger evidence that the two variables are

157
related. However, to be more precise, we require a method of calculating this degree of
difference before we are able to draw a conclusion. The c2 statistic provides us the method to
do this.

The steps to conduct a chi-square test are as follows:


1. State the null and alternative hypotheses.
2. Select a level of significance (a).
3. Calculate the value of the test statistic (i.e. c2)
4. Determine the c2 critical value and form the decision rule.
5. Arrive at conclusion.

6. Performing the chi-square test of independence

To conduct a chi-square test based on the data in Table 11.1, we proceed as follows:

Solution:
Step 1: State the null and alternative hypotheses

In a chi-square test of independence, the null hypothesis must be that the two attributes are
independent or not related. Consequently, the alternative hypothesis is that the attributes are
related. Referring to the contingency table (Table 11.1), we have

Ho: There is no relationship between Gender and Elective Subject chosen.


H1: There is a relationship between Gender and Elective Subject chosen.

Step 2: Select the distribution to use and choose a significance level (a) to conduct the test.

We use the chi-square distribution to make a test of independence for a contingency table. We
shall use a 5% significance level.

Step 3: Calculate the value of the test statistic (i.e. c2)

Before the calculated value of chi-square can be found, we need to determine the expected
frequency (E) for each cell assuming the two variables are not related. The expected frequency
for each cell is calculated using:

Æ*H Ç*,.2 ¸ È*2%1& Ç*,.2


𝐸= &
where n refers to the total sample size

Elective Subject
Market
Gender Stock Investment Creative Arts Total
Research
O E O E O E
Male 93 105.01 70 59.52 12 10.53 175
Female 87 75.04 32 42.55 6 7.56 125
Total 180 180 102 102 18 18 300
Table 11.2 Observed and Expected Frequencies

158
Using this formula, the expected frequencies (E) of the above six cells are calculated as follows:
1 175 𝑋 180 4 125 𝑋 180
= 105.0 = 75.0
300 300
2 175 𝑋 102 5 125 𝑋 102
= 59.5 = 42.5
300 300
3 175 𝑋 18 6 125 𝑋 18
= 10.5 = 7.5
300 300

The chi-square statistic is computed as follows:

é (O - E )2 ù
c 2 = åê ú
ë E û

(𝟗𝟑g𝟏𝟎𝟓.𝟎)𝟐 (𝟕𝟎g𝟓𝟗.𝟓)𝟐 (𝟏𝟐g𝟏𝟎.𝟓)𝟐 (𝟖𝟕g𝟕𝟓)𝟐 (𝟑𝟐g𝟒𝟐.𝟓)𝟐 (𝟔g𝟕.𝟓)𝟐


= 𝟏𝟎𝟓.𝟎
+ 𝟓𝟗.𝟓
+ 𝟏𝟎.𝟓
+ 𝟕𝟓.𝟎
+ 𝟒𝟐.𝟓
+ 𝟕.𝟓

= 𝟏. 𝟑𝟕𝟏 + 𝟏. 𝟖𝟓𝟑 + 𝟎. 𝟐𝟏𝟒 + 𝟏. 𝟗𝟐𝟎 + 𝟐. 𝟓𝟗𝟒 + 𝟎. 𝟑𝟎𝟎 = 8.252

Step 4: Determine the c2 critical value and form the decision rule.

The chi-square test is always right-tailed, hence the rejection region falls on the right tail of the
chi-square distribution. The contingency table contains two rows (Male and Female) and three
columns (Stock investment, Market research and Creative arts). Note that we do not count the
row and column totals.

Degrees of freedom (df) = (Rows -1) (Columns -1)


= (2-1) (3-1) = 2

From Appendix 4 (Chi-square distribution table), for df = 2 and a= 0.05, we have the critical
c2 =5.991. (See Figure 11.2)

Figure 11.2 Chi-square distribution table

The rejection region and non-rejection regions are shown in Figure 11.3

159
Figure 11.3 Rejection and Non-Rejection Regions

Decision rule: Reject Ho if c2 statistic > 5.991.

Step 5: Arrive at conclusion


The value of the test statistic c2 = 8.252 is greater than the critical c2 of 5.991 and it falls in
the rejection region. Hence, we reject the null hypothesis and conclude that there is a
relationship between Gender and Elective Subject chosen at a= 0.05.

Example 11.2
A social scientist sampled 140 people to study the relationship between income level and lottery
playing.
Income Level
Lottery Low Middle High Total

Play 46 28 21 95
Did not Play 14 12 19 45
Total 60 40 40 140
Is it reasonable to conclude that playing lottery is related to income level? Use 0.01 significance
level.

Solution:
Ho: There is no relationship between income level and lottery playing
H1: There is a relationship between income level and lottery playing
a= 0.01
Expected Frequencies (E)
Income
Lottery Low Middle High Total

Play 40.71 27.14 27.14 95


Did not Play 19.29 12.86 12.86 45
Total 60 40 40 140

é (O - E )2 ù
c 2 = åê ú
ë E û
(𝟒𝟔g𝟒𝟎.𝟕𝟏)𝟐 (𝟐𝟖g𝟐𝟕.𝟏𝟒)𝟐 (𝟐𝟏g𝟐𝟕.𝟏𝟒)𝟐 (𝟏𝟒g𝟏𝟗.𝟐𝟗)𝟐 (𝟏𝟐g𝟏𝟐.𝟖𝟔)𝟐 (𝟏𝟗g𝟏𝟐.𝟖𝟔)𝟐
= 𝟒𝟎.𝟕𝟏
+ 𝟐𝟕.𝟏𝟒
+ 𝟐𝟕.𝟏𝟒
+ 𝟏𝟗.𝟐𝟗
+ 𝟏𝟐.𝟖𝟔
+ 𝟏𝟐.𝟖𝟔

= 6.544

160
Degrees of freedom (df) = (Rows -1) (Columns -1)
= (2-1) (3-1) = 2

Critical cLV.Vd,L = 9.210


Decision rule: Reject Ho if c2 statistic > 9.210.

Since c2 statistic < 9.210, we do not reject Ho. There is no relationship between income level
and lottery playing at a= 0.01.

7. Precautions of using the Chi-Square Test

• The observed and expected frequencies shall be “sufficiently large” for the results to be
more reliable.

• When c2 =0, it is too good to be true. We should suspect that


(a) the sample is not random but carefully selected.
(b) the sample consists of fictitious data and is not the result of proper statistical
investigation.

161
8. Discussion questions

1. Here is a cross-tabulation from a survey of 500 women indicating average weekly


expenditures on cosmetics and employment status.

Spending On Employment Status


Cosmetics Full-time Part-time Not working
Less than $10 30 20 60
$10 – $20 55 60 65
Over $20 55 80 75

Test at 0.01 level of significance whether there is any relationship between the
employment status and average weekly spending on cosmetics among women.

2. Jo Sport has a new design of spike shoes and wishes to determine whether there are any
differences in three media used in terms of exposure of an advertisement. The results
of the study are as follows

Media Used
Seen Ad? Television Magazine Internet Total
Yes 70 20 35 125
No 30 30 15 75
Total 100 50 50 200

(a) At the 5% significance level, is there evidence of a relationship between media


type and whether a person has seen the advertisement?
(b) Would your conclusion change if you are testing at 10% significance level?
Explain.

3. SQA Tours has developed 2 different itineraries for tourist groups visiting Singapore for
3 days. The following table shows the itinerary chosen and the country of origin of these
tourist groups.

Country Itinerary 1 Itinerary 2 Total


Taiwan 65 13 78
Japan 35 17 52
Total 100 30 130

(a) At a = 0.05, is there sufficient evidence to conclude that the preferred itinerary
depends on country of origin? Do a chi-square test and show all steps clearly.

(b) Explain why chi-square test is a right-tailed test?

162
9. Supplementary questions

1. A sample of 150 students applying for a place in medical school were tested for
personality type. The following table gives the results of the survey:

Type X Type Y Total


Male 78 42 120
Female 19 11 30
Total 97 53 150

Test at the 5% significance level if gender and personality type are related for all
students.

2. The table below shows a contingency table for a sample of 1104 randomly selected
adults from three types of environment (Urban, Suburban and Rural) and has been
classified into two groups by the level of exercise.

Level of exercise
Total
Environment High Low
Urban 221 256 477
Suburban 230 118 348
Rural 159 120 279
Total 610 494 1104

Test the hypothesis that there is no relationship between level of exercise and type of
environment and draw conclusions. (use a = 1%)

3. In an online road safety research, many drivers admitted to unsafe road behaviour. The
age group and most frequent type of unsafe road behaviour are tabulated below:

Type of unsafe road behavior


Age Group Reckless Using mobile Indiscriminate Accelerate at Total
driving device lane changing amber light
18 to 35 years 3 22 8 7 40
Over 35 years 2 11 18 9 40
Total 5 33 26 16 80

Is there a relationship between age group and the most frequent type of unsafe road
behavior at the 1% significance level? Do a chi-square test and show all steps clearly.
(Express all computations to 1 decimal place)

163
4. Winn Electronics manufactures component parts for tablets and mobile phones. The
company has two machines that are used to make a component called CS1. From time to
time the quality controller at the company takes a sample of CS1 and checks them. A
recent check of 200 units of CS1 showed the following results:

Good Defective Total


Machine Alpha 109 11 120
Machine Beta 66 14 80
Total 175 25 200

Does the sample provide sufficient evidence to conclude that the two variables, the
machine type and the component quality (good or defective) are dependent? Do a chi-
square test using 1% significance level.

164
BUSINESS STATISTICS

SESSION 12

COMPARING MEANS: ANALYSIS OF VARIANCE (ANOVA)

At the end of the session, students should be able to:

1. understand the general approach of analysis of variance technique.


2. describe the type of application that analysis of variance is used for.
3. carry out the one-way ANOVA procedure to test for difference among the means of two
or more groups.
_________________________________________________________________________

1. Introduction

The Analysis of Variance or ANOVA is a procedure used for comparing two or more groups
to establish if the means are equal. We will examine two or more independent samples to
determine if the population means could be equal with respect to only one factor. Hence, this
is known as a one-way ANOVA procedure.

Figure 12.1

The first chart in figure 12.1 shows that all three population means are equal. In the second
chart, not all means are equal.

2. Basic idea behind ANOVA

The basic idea behind ANOVA is that we go through an analysis of the variation in the data,
both between and within the k number of groups. Through an analysis of the total variation in
the data for both between and within the k groups, we are able to draw conclusions about
possible differences in group means.

In ANOVA, we subdivide the total variation into that which is attributable to differences
between the k groups and that which is due to chance or random variation within the k groups.

165
“Within group” variation is considered experimental error, “between group” variation is
attributable to treatment effects.

Figure 12.2 Analysis of variation in the data

3. Assumptions

The F-distribution will be used to test whether the two or more population means are equal. A
one-way ANOVA is carried out under the following assumptions:
- Each sample is drawn from a normal or approximately normal population.
- The populations have equal variances (or standard deviations).
- The samples are randomly selected and are independent.

4. The F-distribution

Like the t and chi-square distributions, the shape of a particular F distribution curve depends
on the number of degrees of freedom. The characteristics of the F Distribution are:
- The F value is always non-negative.
- The F distribution is a family of continuous distribution and the shape is skewed to the
right, but the skewness decreases as the number of degrees of freedom increases.
- It has two numbers for the degrees of freedom : Degrees of freedom for the numerator
and the degrees of freedom for the denominator. The distribution is a continuous
distribution.

Figure 12.3 The F- Distribution

166
5. Performing a One-way Analysis of Variance

We will now look into the procedure to carry out a one-way ANOVA to test whether the means
of two or more populations are equal.
The steps to carry out an ANOVA test are as follows:
1. State the null and alternative hypotheses.
2. Select a level of significance (a).
3. Calculate the value of the F test statistic.
4. Determine the F critical value and formulate the decision rule.
5. Arrive at conclusion.

Example 12.1
A large company used three different training methods in orientating new marketing trainees
to their jobs. Upon completion of the training period, the training director chose 15 trainees,
who were randomly assigned to the three training methods. To compare the effectiveness of
these training methods, the researcher examined the quarterly sales (units) made by the 15
trainees. The results are shown in Table 12.1 below:

Method 1 Method 2 Method 3


92 58 61
70 89 55
65 68 99
72 77 64
81 94 78
Table 12.1

At the 0.05 level of significance, test whether the mean sales for each of the three training
methods are the same. Assume that all the assumptions required to apply the one-way ANOVA
procedure hold true.

Solution:
To conduct the ANOVA test, we proceed as follows:

Step 1: State the null and alternative hypotheses.

H0 : Means for all populations are equal


H1 : The means are not all equal.

Symbolically, it can be expressed as


H0 : µ1 = µ2 = µ3
H1 : At least one population mean is different.

The ANOVA procedure is always one-tailed (right-tailed test).

Step 2: Select the distribution to use and choose a significance level (a) to conduct the test.

We use the F distribution to make a test. We shall use a 5% significance level.

167
Step 3: Calculate the value of the test statistic (i.e. F statistic)

The F statistic is given by the following formula :

𝑆𝑆𝑇D
(𝑘 − 1) 𝑀𝑆𝑇
𝐹= =
𝑆𝑆𝐸D 𝑀𝑆𝐸
(𝑛 − 𝐾)
where SST is the Treatment variation
SSE is the variation due to the Random or Error component
k = number of groups
n= total sample size

The numerator is the Mean Square Treatment which is the mean variation between different
treatment groups.
The denominator is the Mean Square Error which is the mean variation within each treatment
group.

Step 3a: Computing Total Variation (SS Total)

Method 1 Method 2 Method 3


92 58 61
70 89 55
65 68 99
72 77 64
81 94 78

First, we compute the total variation (SS Total). Total variation refers to the sum of the squared
differences between each observation and the overall (grand) mean.
∑Q
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = ∑(𝑋 − 𝑋’Ì )L where 𝑋’Ì = &
∑Q
z
𝑋Ì = &

¥LIUVIZXIULIYdIXYIY¥IZYIUUI¥WIZdIXXI¥¥IZWIUY
𝑋’Ì = dX

= 74.87

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = ∑(𝑋 − 𝑋’Ì )L

= (92 - 74.87)2 + (70 - 74.87)2 + (65 - 74.87)2 + (72 - 74.87)2 + (81 - 74.87)2
+(58 - 74.87)2 + (89 - 74.87)2 + (68 - 74.87)2 + (77 - 74.87)2 + (94 - 74.87)2
+(61 - 74.87)2 + (55 - 74.87)2 + (99 - 74.87)2 + (64 - 74.87)2 + (78 - 74.87)2

= 2659.73

Step 3b: Computing the random or error component (SSE)

Next, we compute the random variation. This is the sum of the squared differences between
each observation and its treatment mean.

168
𝑆𝑆𝐸 = ∑(𝑋 − 𝑋’' )L

where 𝑋’' = Column mean, that is the sample mean of each group

Method 1 Method 2 Method 3


92 58 61
70 89 55
65 68 99
72 77 64
81 94 78

¥LIUVIZXIULIYd
Sample mean for method 1, 𝑥̅d = X
= 76
XYIY¥IZYIUUI¥W
Sample mean for method 2, 𝑥̅L = X
= 77.2
ZdIXXI¥¥IZWIUY
Sample mean for method 3, 𝑥̅T = X
= 71.4

𝑆𝑆𝐸 = ∑(𝑋 − 𝑋’' )L

= (92 -76)2 + (70 -76)2 + (65 -76)2 +(72 -76)2 (81 -76)2
+ (58 -77.2) 2 + (89 -77.2) 2 +(68 -77.2) 2 +(77 -77.2) 2 +(94 -77.2)2
+ (61 -71.4)2 + (55 -71.4)2 + (99 -71.4)2 + (64 -71.4)2 + (78 -71.4)2

= 2566

Step 3c: Computing SST


Finally, we determine SST, the sum of the squares due to the treatments, by subtraction.

SSTotal = SST + SSE


Hence, SST = SSTotal – SSE
= 2659.73 – 2566
= 93.73

Step 3d: Compute the F statistic


00ÇD
(Ígd)
𝐹 = 00µ
D(&gÎ)
¥T.UTD
(Tgd)
= LXZZD
(dXgT)
= 0.219

Step 4: Determine the F critical value and formulate the decision rule.

The significance level is 0.05, which means the area in the right tail of the F distribution is 0.05.
Degrees of freedom (numerator) =k–1 =3–1=2
Degrees of freedom (denominator) = n – k = 15 -3 =12

Referring to Appendix 5 (F distribution), the critical value of F is 3.89 (see Figure 12.4)

169
Figure 12.4 F-Distribution Table

Decision rule: Reject Ho if F statistic > 3.89

Step 5: Arrive at conclusion.

The value of the F test statistic = 0.219 is less than the critical F of 3.89 and it falls outside the
rejection region (see Figure 12.5). Hence, we do not reject the null hypothesis and conclude
that there is insufficient evidence to show a difference in the mean sales among the three
methods at a= 0.05.

Figure 12.5 Rejection and Non Rejection Regions

6. Performing one-way ANOVA using Excel

Input the data as follows (Figure 12.6):

Figure 12.6 Data Input

170
Next click Tools-> Data Analysis->Anova:Single Factor

Figure 12.7Analysis Tools

Select the input range.

Figure 12.8 Input Range

The following excel output (Figure 12.9) will be generated:

Figure 12.9 Anova Excel Output

Statistical software packages e.g. SPSS, Minitab will also provide output for the ANOVA
Table using a similar format. (see Table 12.2)

Source of Variation Sum of Squares Degrees of Freedom Mean Square F


Treatments SST k-1 SST/(k-1) =MST
MST/MSE
Error SSE n-k SSE/((n-k)=MSE

Total SS Total n-1


Table 12.2 ANOVA Table

171
Example 12.2
Last year was a boom year for the stock market with stock prices for many industries hitting
historical highs. A study was carried out to find out whether the mean rates of return (in
percent) for stocks of companies in four industries are the same. The following results were
generated using Excel.

Groups Count Sum Average Variance


Property &Construction 7 51.6 7.3714 0.5023
Banking & Finance 7 74.7 10.6714 3.8557
Manufacturing 9 75.5 8.3888 0.7186
Technology 6 48.9 8.1500 0.7670
ANOVA
Source of Variation SS df MS F
Between Groups 42.1592 A 14.0530 ?
Within Groups 35.7324 25 B
Total 77.8916 28

(a) Compute the values marked “A” and “B”.


(b) Formulate the null and alternative hypotheses.
(c) At the 1% significance level, test the null hypothesis that the mean rates of return for
companies in the four industries are equal. Show all steps of the test clearly.

Solution (a):
A: df (numerator) = k – 1 = 4 -1 = 3
B: MSE = SSE/(n-k) = 35.7324/25 = 1.4293

Solution (b):
H0 : µproperty = µbanking = µmanufacturing = µtechnology
H1: The means are not all equal

Solution (c):
SST
F=
(k - 1)
SSE
(n - k )
42.1592D
(4 − 1)
= = 9.832
35.7324D
(29 − 4)

Critical F0.01,3,25 = 4.68 (refer Appendix 5)


Decision rule : Reject Ho if F statistic > 4.68

Since F statistic > 4.68, reject Ho. There is a difference in the mean rates of return for
companies in the four industries. at a=0.01.

172
7. Discussion questions

1. A physician who specializes in weight control has three different diets he recommends.
He randomly selected 15 patients and then assigned 5 to each type of diet. After six
months, the following weight loss in kilograms were noted.

Diet A Diet B Diet C


5 6 7
7 7 8
4 7 9
5 5 8
4 6 9

At the 0.05 significance level, can he conclude that there is a difference in the mean
amount of weight loss among the three diet types. Show all steps clearly.

2. A factory owner would like to determine whether there is a difference between the mean
numbers of breakdowns in three factories at different locations. He recorded the number
of breakdowns in each factory for a sample of 8 days.

Tuas Tampines Sembawang


2 2 5
5 1 3
5 1 5
3 1 2
2 1 1
4 3 3
2 5 5
4 4 5

The owner performed an ANOVA analysis and produced the following output:
Groups Count Sum Average Variance
Tuas 8 27 3.375 1.696429
Tampines 8 18 2.25 2.5
Sembawang 8 29 3.625 2.5533571

Source of Variation SS df MS F
Between Groups 8.58 2
Within Groups 47.25 21
Total 55.83 23

At the 0.01 significance level, can the factory owner conclude if there is a difference
in the mean number of breakdowns?

173
3. Amazing Tours wants to determine whether the average expenditure of the tourists
from China, Japan, India and Korea are different. A sample of 7 tourists was taken
from each of the 4 countries.

(a) State the Null and Alternative hypotheses.


(b) What is the critical F value at the 5% significance level?
(c) At the 5% significance level, can you conclude whether there is a difference in
the average expenditure of the tourists from China, Japan, India and Korea?
(d) Suppose Amazing Tours wants to determine whether preferred itinerary is
related to the tourists’ country of origin. Can the ANOVA test be used?
Explain.

4. Roger wishes to find out whether there were differences in the mean satisfaction ratings
(1=lowest and 10=highest) for the 3 flat types that he renovated. Results from a sample
of clients are shown below:

Groups Count Sum Average Variance


3-room 6 43 7.16666 1.76666
4-room 7 58 8.28571 1.90476
5-room 7 53 7.57142 1.61904

Source of Variation SS df MS F
Between Groups A 2 2.1119 C
Within Groups 29.9761 17 B
Total 34.2000 19
(a) State the null and alternative hypotheses.
(b) Fill in the missing values labelled “A”, “B” and “C”
(c) Using a=0.05, test whether there were significant differences in the mean
satisfaction ratings amongst the clients of different flat types. Show all steps
clearly.

174
8. Supplementary questions

1. The following ANOVA table, based on information obtained for four samples selected
from four independent populations that are normally distributed with equal variances,
has a few missing values.

Source of Variation SS df MS F
Between Groups
4.07
Within Groups 15 9.2154
Total 18

(a) Find the missing values and complete the ANOVA table.
(b) Using a=0.05, what is your conclusion for the test? State clearly your null
and alternative hypothesis.

2. Given the following table of information, use of a significance level of 0.01 and test
whether the treatment means are equal.

Treatment 1 Treatment 2 Treatment 3


3 3 5
6 1 10
9 5 3
2 2 2
7 7 9

(a) State the null and alternate hypotheses.


(b) State the decision rule.
(c) Using the information above, calculate the values of SS Total, SSE, and SST.
(d) Construct an ANOVA table.
(e) What is your decision with regards to the null hypothesis?

3. AA Cooker is one of the most popular brands of rice cooker and the cookers are being
sold at various outlets of supermarkets. To further improve its marketing strategies, the
cookers are test-marketed by having displays placed at different areas of the
supermarket. The table below shows the number of cookers successfully sold at five
different locations in the supermarket during three randomly selected days.

Daily Sales (No. of units)


Day 1 Day 2 Day 3
Near other rice cookers 14 15 13
Near toiletries 8 6 7
Near frozen foods 4 5 6
Near the wine section 7 5 5
Near the cashier 11 8 7

Using a significance level of 0.01, state whether there are any differences in the mean
number of cookers sold at the five different locations.
(a) State the null and alternate hypotheses.

175
(b) State the decision rule.
(c) Given the following partial ANOVA table, draw your conclusions.

Source of
SS df MS F
Variation
Treatments 155.6 4
Error 17.33333 10
Total 172.9333 14

4. A company manager is evaluating 3 training programmes (K, L, M) for training its


marketing personnel. 12 new staff were selected and 4 were assigned to each training
programme. The six-month sales figures (number of units) were recorded for these
staff.
Programme
K L M
65 62 68
54 58 65
57 64 72
60 74 63

You helped to analyse the data using Excel and obtained the following results.

Groups Count Sum Average Variance


K 4 236 59 22
L 4 258 64.5 46.3333
M 4 268 67 15.3333

ANOVA
Source of Variation SS df MS F
Between Groups 2 67
Within Groups 251
Total

(a) Which training programme shows the highest average sales figure amongst the
staff?
(b) Formulate the null and alternative hypothesis.
(c) Do the data provide sufficient evidence to indicate a difference in mean sales
for staff trained under the three programmes? (assume a=5%)

176
5. A one-way ANOVA analysis was conducted to explore whether there were significant
differences in the mean credit card spending (in $) for 3 types of credit cards that ABC
Bank issues. Partial results of the analysis are shown in the tables below:

Groups Count Sum Average Variance


Ladies 6 4276 712.6666667 220412.6667
Revolution 8 5013 626.625 73656.55357
Platinum 6 4945 824.1666667 310212.1667

ANOVA
Source of Variation SS df MS F
Between Groups 133800.1583
Within Groups 3168720.042
Total 3302520.2

(a) How many groups (k) were surveyed and what is the total sample size(n)?
(b) State the null and alternative hypotheses.
(c) Compute the F statistic.
(d) At the 1% significance level, is there a difference in the mean credit card
spending for the three types of cards? Show all steps clearly.

177
MOCK EXAM PRACTICE PAPER

Section A

Question 1

(a) A marketer wants to obtain feedback for the design of a new product packaging. Five
designs are being considered and respondents were asked to rank their preferences for
these designs (5= Most Preferred and 1= Least Preferred). The sample of respondents
was obtained by interviewing every 20th shopper who walks into a particular store.

(i) Are the data values obtained discrete or continuous?

(ii) Identify the level of measurement for the response value.

(iii) What sampling method is being used?

(b) An interior designer, has secured several jobs from a newly completed Build-to-Order
(BTO) project. His clients have varying budgets. The following table shows the
frequency distribution of the renovation budgets from a sample of 20 clients.
Budget ($000) Frequency
5 up to 20 3
20 up to 35 7
35 up to 50 5
50 up to 65 4
65 up to 80 1

(i) Compute the mean and standard deviation. (Express your answers up to 2
decimal places)

(ii) Find the percentage of clients with budgets below $50,000.

(iii) Construct a histogram. Label your diagram clearly.

(iv) We should always use the mean as a measure of central tendency. Discuss this
statement.

(c) The number of traffic violations due to speeding is on the rise. The table below shows
the data collected from 500 drivers who have speeding violations over the past three
months and their car type.

Speeding violation
Car Type Total
Yes No
Small cars (< 2500 cc) 50 150 200
Big cars (>=2500 cc) 75 225 300
Total 125 375 500

(i) Find the probability that a randomly selected driver has a speeding violation.

178
(ii) Given that a randomly selected driver drives a small car, what is the probability
that the person does not have a speeding violation?

(iii) Find the probability that a randomly selected driver has a speeding violation and
drives a big car.

(iv) Find the probability that a randomly selected driver does not have a speeding
violation or does not drive a big car.

Question 2

(a) What is the purpose of drawing a scatter diagram when using correlation and regression
analysis?

(b) The supervisor at Hoe’s factory has collected data on a random sample of 8 workers
about the time workers spent on attending training programmes (in hours) and the
number of units manufactured per day.

Worker Number of hours spent Number of units


on training (x) manufactured (y)
1 34 9
2 25 8
3 10 4
4 30 9
5 20 6
6 12 5
7 15 8
8 22 7

Given: X =168 Y = 56 X2 = 4034 XY = 1270

(i) Plot the data on a scatter diagram.

(ii) Find the estimating equation for the regression line.

(iii) Interpret the gradient ‘b’.

(iv) Estimate the number of units manufactured by a worker who spends 42 hours
on training programmes. Comment on the likely accuracy of the estimate.

179
(c) A car dealer wanted to investigate how the price of one of its car models decreases with
age. The research department took a sample of eight cars and collected the following
information on the age and selling price of these cars. The data are shown below:
Age of car (years) 8 3 6 9 2 5 6 3
Selling Price ($000) 30 70 26 34 85 36 40 69

A regression analysis produced the following output:

Regression Statistics
Multiple R 0.867335
R Square 0.75227
Adjusted R
Square 0.710982
Standard Error 12.023763
Observations 8

Standard
Coefficients Error t Stat P-value
Intercept 89.603448 10.472557 8.556023 0.000139
Age of car
(years) -7.781609 1.823038 -4.268483 0.005271

(i) State the independent and dependent variables.


(ii) State the regression equation from the output given. (Express answers to 3
decimal places.

(iii) Determine the coefficient of correlation and interpret the value. (Express answer
to 3 decimal places).

180
Section B

Question 3

(a) A random sample of 120 small retail outlets showed that 75 of the firms use cashless
payments.

(i) Calculate a point estimate for the proportion of all small retail outlets sole
proprietorship firms that use cashless payments.

(ii) Construct a 95% confidence interval to estimate the proportion of all sole
proprietorship firms that use cashless payments.

(b) A property agent is interested to find out the prices of private apartments located at
Changi. He sampled 70 properties and found the mean and standard deviation to be
$940,000 and $133,000 respectively. Compute a 90% confidence interval for the true
mean price of private apartments located at Changi.

(c) For each of the following changes, state whether confidence interval for µ will become
wider or narrower:

(i) an increase in level of confidence

(ii) an increase in sample size

(iii) an increase in variability of the characteristic being measured.

(d) A potential entrepreneur is considering the purchase of a coin-operated laundry. The


present owner claims that over the past 3 years average daily revenue has been at least
$675 with a standard deviation of $75. A sample of 36 selected days reveals a daily
average revenue of $625. The potential entrepreneur thinks that the daily revenue could
be lower than what the present owner claims.

(i) Formulate the null and alternative hypotheses.

(ii) At the 1% significance level, is there evidence to conclude that the mean daily
revenue is less than what the owner claimed?

181
Question 4

(a) In recent years, there has been concern that in-patient hospital charges at government
hospitals differ widely. To find out whether the average hospital bill differ significantly,
data were collected from patients who stayed 5 days at four government hospitals for a
similar medical condition.

Groups Count Sum Average Variance


AA Hospital 7 43682 6240.285 845725.238
CC Hospital 6 30711 5118.5 2675969.9
SS Hospital 7 36650 5235.714 2326149.238
NN Hospital 6 43095 7182.5 176121.9

ANOVA
Source of
Variation SS df MS F
Between
Groups 17411832.3 3 5803944.099 ?
Within Groups 33291705.86 22 1513259.357
Total 50703538.15 25

Examine the above Excel output which attempts to analyse whether the average bills
incurred at four hospitals are the same.

(i) State the null and alternative hypotheses.

(ii) At the 1% significance level, is there a significant difference in the mean hospital
bill for the four hospitals? Show all steps of the ANOVA test clearly.

(b) The average 5-day hospital bill at AA hospital is known to be normally distributed with
mean $6800 and a standard deviation of $1200.

(i) What is the probability a patient’s bill is more than $8500?

(ii) Suppose 10% of the patients’ bills are $X or less. Find the value of X.

(c) Cantone Group has three similar restaurants located at different parts of Singapore.
The restaurant management has commissioned a survey to determine customers’
satisfaction on the quality of food at each of the three restaurants. A random sample
of 100 customers was selected from each restaurant.
The results of the survey are shown in the following table:
Overall Satisfaction Rating TOTAL
Excellent Average Below Average
Restaurant 1 59 32 9 100
Restaurant 2 48 44 8 100
Restaurant 3 64 26 10 100
TOTAL 171 102 27 300

At the 5% significance level, use a chi-square test to determine whether there is


evidence of relationship between restaurant and overall satisfaction rating.

182
MOCK PRACTICE PAPER (ANSWERS)

Section A

Question 1

(a)(i) Discrete
(ii) Ordinal scale
(iii) Systematic sampling

(b)(i)

Class midpoint freq fM 𝑥̅ f(M-𝑥̅ )2


5 up to 20 12.5 3 37.5 37.25 1837.69
20 up to 35 27.5 7 192.5 37.25 665.44
35 up to 50 42.5 5 212.5 37.25 137.81
50 up to 65 57.5 4 230 37.25 1640.25
65 up to 80 72.5 1 72.5 37.25 1242.56

20 745 5523.75

Ï+x UWX
𝑥̅ = &
= LV
= 37.25 𝑖. 𝑒. $37,250

Ï+(xgQ̿ ) XXLT.UX
𝑠=k &gd
=k LVgd
=17.051 i.e. $17,051

(ii) (3+7+5)/20 X 100% = 75%

(iii)

Histogram showing clients'


budgets
8

6
No of clients

0
12.5 27.5 42.5 57.5 72.5
Budget ($000)

183
(iv) The mean is a familiar concept and simple to understand and compute.
However, it is affected by extreme values. In such situations, the median may be
preferred.

(c)(i) P (Yes) = 125/500 = 0.25


𝑷(𝑵𝒐∩𝑺𝒎𝒂𝒍𝒍 𝒄𝒂𝒓) 𝟏𝟓𝟎/𝟓𝟎𝟎
(ii) 𝑷(𝑵𝒐|𝒔𝒎𝒂𝒍𝒍 𝒄𝒂𝒓) =
𝑷(𝑺𝒎𝒂𝒍𝒍 𝒄𝒂𝒓)
= 𝟐𝟎𝟎/𝟓𝟎𝟎 = 𝟎. 𝟕𝟓

(iii) P(Yes Ç Big car) = P(Yes) X P(Big car|Yes)


= 125/500 X 75/125 = 75/500 = 0.15

(iv) P(NoÈ Not Big car) = P(No) + P(Small car) - P(NoÇSmall car)
= 375/500 + 200/500 – 150/500 = 425/500 = 0.85

Question 2

(a) A scatter diagram provides an indication whether any relationship exists between the
two variables under study.

(b)(i)

(ii)
𝒏𝚺𝑿𝒀g𝚺𝑿𝚺𝒀 𝟖(𝟏𝟐𝟕𝟎)g(𝟏𝟔𝟖)(𝟓𝟔)
𝒃 = 𝒏𝚺𝑿𝟐 g(𝚺𝑿)𝟐 = 𝟖(𝟒𝟎𝟑𝟒)g(𝟏𝟔𝟖)𝟐
= 0.18577
z = 𝟓𝟔 – 𝟎. 𝟏𝟖𝟓𝟕𝟕 ±𝟏𝟔𝟖² = 3.0988
z − 𝒃𝑿
𝒂=𝒀 𝟖 𝟖

• = 𝟑. 𝟎𝟗𝟖𝟖 + 𝟎. 𝟏𝟖𝟓𝟖𝑿
Regression equation : 𝒀

(iii) b= 0.1858
b is positive. This indicates a direct relationship between no of training hours and no
of units manufactured. For every one hour increase in training time, no of units
manufactured expected to increase by 0.186 units.

(iv) 3.0988 + 0.1858x = 3.0988 + 0.1858(42) = 10.9 units


The prediction is outside the range of data that was collected. This is an extrapolation.
Estimate may be inaccurate and should be used with caution.

184
(c)(i) Independent variable (X) : Age of car (years)
Dependent variable (Y) : Price ($000)

(ii) • = 𝟖𝟗. 𝟔𝟎𝟑 – 𝟕. 𝟕𝟖𝟐𝑿


𝒀

(iii) r = - 0.867
The value shows a strong, negative (inverse) relationship between age of car and the
selling price.

Question 3

UX
(a)(i) 𝑆𝑎𝑚𝑝𝑙𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛, 𝑝 = dLV = 0.625
Point estimate of µ= p =0.625

(ii) 95% confidence interval:


K(dgK) V.ZLX(dgV.ZLX)
𝑝 ± 𝑧k &
= 0.625 ± 1.96k dLV
= (0.5384, 0.7116)

(b) n = 70, x = 940000, s = 133000

90% confidence interval:


𝒔
z
𝒙±𝒛
√𝒏
dTTVVV
= 940000 ± 1.645
√UV
= 940000 ± 26149.81
= $(913,850.19 ;966,149.81 )

(ci) wider
(ii) narrower
(iii) wider

(d)(i) Ho: µ ≥ 675


H1: µ < 675

(ii) a = 0.01 n = 36 𝑥̅ = 625 s = 75


z
𝒙g𝝁
Test statistic, 𝒛 = 𝒔
𝒏
ZLX g 𝟔𝟕𝟓
= àÁ = -4.0
√¼½
Critical ‘ z’ value= -2.33
Decision rule: Reject Ho if test statistic <-2.33
Since test statistic < -2.33, reject Ho and conclude that the average revenue is less than
what the owner claimed at a = 0.01 .

185
Question 4
(a)(i) H0 : µAA = µCC = µSS = µNN
H1: The means are not all equal

(ii) F statistic = MST/MSE = 5803944.099/1513259.357 = 3.835

Critical F0.01,3,22 = 4.82


Decision rule : Reject Ho if F statistic > 4.82

Since F statistic < 4.82, do not reject Ho. There is insufficient evidence to conclude any
difference in the mean hospital bills for the 4 hospitals at a=0.01.

𝟖𝟓𝟎𝟎 g 𝟔𝟖𝟎𝟎
(b)(i) P(x>8500) = P(𝒛 > 𝟏𝟐𝟎𝟎
)
= P(z> 1.42)
= 0.5 – 0.4222=0.0778

(ii) Given : µ = 6800 s =1200


𝒙 − 𝝁
𝒛=
𝝈
𝒙 − 𝟔𝟖𝟎𝟎
−𝟏. 𝟐𝟖 =
𝟏𝟐𝟎𝟎

x = $5,264

(c) H0 : There is no relationship between restaurant and overall satisfaction rating.


H1 : There is a relationship between restaurant and overall satisfaction rating.
O E
Below Below
Excellent Average Total Excellent Average Total
average average
R1 59 32 9 100 R1 57 34 9 100
R2 48 44 8 100 R2 57 34 9 100
R3 64 26 10 100 R3 57 34 9 100
Total 171 102 27 300 Total 171 102 27 300
R1= Restaurant 1, R2 = Restaurant 2, R3 = Restaurant 3
é (O - E ) ù 2
c 2 = åê ú
ë E û
(𝟓𝟗 − 𝟓𝟕)𝟐 (𝟑𝟐 − 𝟑𝟒)𝟐 (𝟗 − 𝟗)𝟐 (𝟒𝟖 − 𝟓𝟕)𝟐 (𝟒𝟒 − 𝟑𝟒)𝟐 (𝟖 − 𝟗)𝟐 (𝟔𝟒 − 𝟓𝟕)𝟐
= + + + + + +
𝟓𝟕 𝟑𝟒 𝟗 𝟓𝟕 𝟑𝟒 𝟗 𝟓𝟕
(𝟐𝟔 − 𝟑𝟒)𝟐 (𝟏𝟎 − 𝟗)𝟐
+ + = 𝟕. 𝟓𝟏𝟒
𝟑𝟒 𝟗

df = (r-1) (c-1) = (3-1)(3-1) =4


Critical χ2 = 9.488 (a =0.05)
Decision rule : Reject Ho if χ2 statistic > 9.488

Since χ2 < 9.488, do not reject Ho. There is insufficient evidence to conclude a
relationship between restaurant and overall satisfaction rating at a= 5%.

186
ANSWERS TO DISCUSSION AND SUPPLEMENTARY QUESTIONS

Session 1: Discussion questions

1(a) Quantitative, continuous (b) Qualitative


(c) Quantitative, discrete (d) Qualitative

2. Employee Name: Nominal


Staff Code: Nominal
Gender: Nominal
Performance Rating: Ordinal
Salary: Ratio
Age Group: Ordinal
Years of experience: Ratio

3(a) Sample (b) Population

4(a) Descriptive (b) Inferential

5(a) Statistic (b) Parameter

6(a) Selection bias (b) Non-response bias


(c) Selection bias (d) Non-response bias
(e) Response bias

7(a) Simple random sampling (b) Systematic Sampling


(c) Stratified Sampling

Session 1: Supplementary questions

1(a) Inferential (b) Descriptive


(c) Descriptive (d) Inferential

2(a) Quantitative, continuous (b) Quantitative, discrete


(c) Qualitative (d) Quantitative, discrete

3(a) Ratio (b) Ratio


(c) Interval (d) Ordinal
(e) Ratio (f) Nominal

4(a) Sample (b) Population

5(a) No fixed answers (b) No fixed answers


(c) No fixed answers (d) No fixed answers

6(a) 200, 10.9 (b) 20, 10.4

7(a) Discrete (b) Ordinal


(c) Systematic Sampling

187
Session 2: Discussion questions

1(a) Qualitative (b) Frequency Table


(c)

(d)

2.

3(a)

(b)

4(a) Frequency distribution

188
(b)

(c) 7, $11.30, $10.20 (rough estimate)

5(a)

(b) 8 (c) 10.1, 10.2 10.4, 10.8


(d) 9.5 (e) 11.6, 7.7

6(a) 98 (b) 23.5%


(c)
FRQUENCY POLYGON
60
No of respondents

40
20
0
1500 4500 7500 10500 13500 16500 19500
Amount Spent ($)

Positively skewed.

7(a)

(b) 18%

189
(c)

Session 2 Supplementary

1(a) Qualitative (b) Frequency Table


(c)

(d)

2(a)

(b) 57 (c) 28.5%


(d) 22%

3(a) 4 (b) 6
(c)

190
4(a) `(b)

5(a) 27 (b) 51.9%


(c) Histogram

6(a) 62 (b) 72
(c)

(d)

(e) 12 days

7(a)

(b) Easier to manage and analyse the data.


(c) 55%

191
(d)

(e) Pie chart. Qualitative variable with few categories.

8.

Positive relationship

9(a) Nominal (b) Contingency Table


(c) Women 78 %; Men 38%. Women more likely to order ice-cream.

10(a) Opinion: Nominal; Number of shares held: Ordinal


(b) Contingency Table
(c) The group with 100 up to 500 shares.

Session 3: Discussion questions


∑Q
1(a) 𝜇= R (b) 83
(c) Parameter

2(a) 7 (b) 0
(c) No.

3(a) 450; 425; 200


(b) Median preferred. Not affected by extreme values.
(c) 800 (d) Variance or standard deviation.

4(a) $3380 (b) 39,490.80


(c) $198.72
(d) Larger average salary and larger spread of salaries.

192
5 2.33; 1.53

6(a) Frequency Distribution (b) $12.2m;


(c) $3.99m

7(a) $375 to $425 (b) $350 to $450


(c) $325 to $475

Session 3: Supplementary questions

1. 21.5; 22.5; 21

2(a) 52.17 kg (b) Right (positively) skewed.


(c) Median preferred.

3. 41 approximately

4(a) Daily sales (b) Grouped


(c) Left (negatively) skewed

5(a) 3; 3.7 (b) Median will change, variance will not.

6. Left (negatively) skewed. Mean < median.

7(a) $52,500; $35,000


(b) Median preferred. Not affected by outlier (150).
(c) $48,964

8(a) 32.55 kg (b) 126.73


(c) 11.26kg

9(a) Population (b) 8.25 years; 5.78 years

10(a) 35 (b) No. There is an extreme value (98)

Session 5: Discussion questions

1(a) 0.24 (b) 0.7

2(a) 0.2

193
(b)

3(a) 0.119 (b) 0.159

4. 6

5(a) 792 (b) 95,040

6. 0.565

7(a)

(b) 0.625 (c) 0.24


(d) 0.225

8(a) 0.767 (b) 0.533


(c) 0.1 (d) 0.75
(e) 0.643

9(a) 0.7 cars; 0.61; 0.781 cars (b) 0.5

Session 5: Supplementary questions

1. 0.75; 0.9375

2. 0.1

3(a) 0.169 (b) 0.914


(c) 0.203

4(a) 0.25
(b) 0.25

5(a) 0.1 (b) 0.5


(c) 0.0333

6. 0.5625

7(a) 0.185 (b) 0.075

194
8(a) 0.0158 (b) 0.716

9(a) 0.286 (b) 0.0286


(c) 0.686 (d) 0.314

10(a) 0.1833 (b) 0.455


(c) 0.483

11. 0.96; 0.04

12. 120

13. 10

14. 1320

15. 36

16(a) 2.0; 2.8; 1.67 (b) 0.8

Session 7: Discussion questions

1(a)
Advertising vs Sales
15
Sales ($m)

10
5
0
0 1 2 3 4 5
Advertising ($m)

Strong, direct relationship.


(b) 2.2 (c) 1.5
(d) ž
𝑌 = 1.5 + 2.2𝑋
(e) x increases by1 unit; y increases by 2.2 units
(f) 0.965 (g) 0.931
(h) $8.1 m

2(a) 0.4518 (b) -0.672


(c) 𝑌ž = 8.14 − 0.89𝑋; slope = -0.89
(d) 3.69 hours (e) No. Extrapolation

3(a) Strong, direct relationship (b) 𝑌ž = −127955.052 + 9089.726𝑋


(c) 0.916 (d) Operating hours, promotional expenses
(e) 0.957 (f) 9089.726
(g) -$7,143,825.75 (h) No. Extrapolation.
(i) Decrease by $1,817,945.20 (j) 894

195
Session 7: Supplementary questions

1(a) 𝑌ž = 25 + 7𝑋
(b) Sales increase $7,000 for every additional sales call made.
(c) $67,000
(d) 0.996. Strong, direct relationship.

2(a)
Income vs Food Expenditure
20
Food expenditure ($00)

15
10
5
0
0 10 20 30 40 50 60
Income ($00)

(b) Strong, direct relationship


(c) 0.959. Strong and positively correlated.
(d) • = 𝟏. 𝟏𝟒𝟐𝟐 + 𝟎. 𝟐𝟔𝟒𝟐𝑿
𝒀
(e) Food expenditure increases by $26.42 for every $100 income increase.
(f) $933.20 (g) No. Extrapolation.

3(a) Independent: Age of car (years) Dependent: Selling price ($000)


(b)

(c) -0.5435. Inverse relationship. Moderate strength


(d) 0.2954. 29.54% of variation in selling price explained by variation in age of car.
(e) 𝑌ž = 11.179 − 0.479𝑋 (f) $6,389
(g) Age of car increases by 1 year, selling price decreases by $479

4(a) 𝑌ž = 12.17 + 0.59𝑋


(b) a = 12.17. Experience =0, number of units assembled = 12.17
b = 0.59. Number of units assembled increases by 0.59 for every additional year of
experience
(c) 19.25 units (d) 26.92 units. Extrapolation.
(e) 0.842. Strong, direct relationship

196
5(a) Independent : flight hours Dependent : Airfare ($)
(b) ž
𝑌 = −20.423 + 131.401𝑥 (c) 0.818 Strong, positive relationship.
(d) 0.668 About 66.8% of variation in airfare is explained by the flight hours.
(e) $636.58
(f) The airfare will cost $262.80 more. (g) No. Extrapolation.

6(a) Independent: Years of driving experience Dependent: Number of demerit points


(b)

(c) -0.302. Number of demerit points decreases by 0.302 for every additional year of
driving experience.
(d) r = - 0.788 Strong, inverse relationship.
(e) 6 demerit points more. (f) X = 4 (whole number)

Session 8: Discussion questions

1(a) 0.1056 (b) 0.8276


(c) 0.2029

2(a) 0.2514 (b) 0.3450


(c) $374.80 (d) 0.0001. Not likely.

3(a) 0.1038 (b) 0.0981

4(a) 0.0475 (b) 0.9876

5(a) 0.0838 (b) 0.9599

Session 8: Supplementary questions

1(a) 0.2033 (b) 0.4595

2. 51 months

3(a) 46.95% (b) $5.72


(c) 0.0122

4(a) 34.46% (b) 60.06%


(c) 10.39% (d) $45,200

197
5. 0.0667

6(a) 0.0880 (b) 0.9924


(c) 0.0099

7(a) 0.0418 (b) 0.2373


(c) 0.0418

8(a) 0.0668 (b) 0.0170


(c) 0.13%

9. 83.4 marks

10(a) 0.3238 (b) 9.18%

Session 9: Discussion questions

1(a) 26 weeks (b) (24.28; 27.72) weeks


(c) Not reasonable. Value outside the confidence interval.

2(a) ($6950.50; $7,649.50)


(b) Interval wider as margin of error increases.

3. (0.641; 0.739)

4(a) n<30 and s unknown. Assume population normally distributed.


(b) 1.753 (c) (51.24; 68.77) litres

5. 107

6(a) (34.5; 46.7) years


(b)(i) (37.7; 43.5) years (b)(ii) (0.616; 0.884)

7(a) (8.25; 9.35) transactions (b) No, it falls outside the interval.
(c) 119

Session 9: Supplementary questions

1(a) (5.46; 6.33) days (b) (5.54; 6.26) days

(2) (70.54; 71.86) kg


No, the value 68.7 falls outside the confidence interval.

(3) (0.476; 0.702)

(4) (33.45; 35.15) minutes

198
5(a) 3.847 kg
(b) (76.62; 86.18) kg Fertiliser makes a difference to the crop yield.

6(a) n<30 and s unknown. (b) (1.63; 1.97) times

7. (0.538; 0.762)

8(a) (141.02; 146.41) (b) (140.51; 146.93)


(c) (139.50; 147.94) (d) Yes. z-values are larger

9(a) (76.20; 81.60) (b) (77.09; 80.70)


(c) (77.28; 80.52) (d) Yes. Smaller standard errors.

10(a) 15.95 (b) (15.91; 15.99) ounces


(c) Increase sample size or Decrease the confidence level.

Session 10: Discussion questions

1(a) Ho : µ = 16 H1 : µ ¹ 16
(b) Test statistic: z = 0.8; Critical value: z = ± 1.96; do not reject H0.

2. Ho : µ £ 305 H1 : µ > 305


Test statistic: z = 3.50; Critical value: z = 1.645; reject H0.

3. Ho : µ ³ 20.5 H1 : µ < 20.5


Test statistic: t = -2.8; Critical value: t = -2.602; reject Ho.

4. Ho : π ³ 0.35 H1 : π < 0.35


Test statistic: z = -1.26; Critical value: z = -1.645; do not reject H0.

5(a) Ho : µ ³ 20 H1 : µ < 20
(b) Test statistic: t = -2.86; Critical value: t = -1.711; reject Ho
(c) Probability of Type I error = a = 0.05. Probability of rejecting a true Ho.

Session 10: Supplementary questions

1. Ho : µ £ 20 H1 : µ > 20
Test statistic: z = 8.86; Critical value: z = 2.33; reject H0.

2. Ho : µ ³ 42.3 H1 : µ < 42.3


Test statistic: t = -3.08; Critical value: t = -1.714; reject Ho

3(a) Ho : µ ³ 80 H1 : µ < 80
Test statistic: z = -2.94; Critical value: z = -1.645; reject H0.
(b) Probability of Type I error = a = 0.05. This is the probability of rejecting Ho when it
is true.

199
4(a) Ho : µ = 190 H1 : µ ¹ 190
(b) Test statistic: t = -2.53; Critical value: t = ± 2.262; reject H0.

5(a) sample mean = 5.64; sample standard deviation = 0.635


(b) Ho : µ ³ 6 H1 : µ < 6
Test statistic: t = -1.604; Critical value: t = -2.998; do not reject Ho

6(a) Ho : µ ³ 3 H1 : µ < 3
Test statistic: z = -1.77; Critical value: z = -1.645; reject Ho.
(b) p-value =0.038 which is < a=0.05; reject H0.

7(a) Ho : µ ³ 20 H1 : µ < 20
(b) Reject 𝐻V if Test Statistics < −1.645
(c) Test statistic: z = -3.68
(d) Reject H0.
(e) The mean amount is less than 20 ounces, at 0.05 level of significance.
(f) p-value =0.0001 which is < a=0.05; reject Ho.

8(a) Ho : µ ³ 3425 H1 : µ < 3425


(b) Test statistic: z = -3.95; Critical value: z = -2.05; reject Ho.

9(a) Ho : π ³ 0.99 H1 : π < 0.99


(b) Test statistic: z = -2.01; Critical value: z = -1.28; reject Ho.

10. Ho : π £ 0.04 H1 : π > 0.04


Test statistic: z = 2.16; Critical value: z = 1.645; reject Ho.

Session 11: Discussion questions

1. Test statistic: c2 = 17.251; Critical value: c2 = 13.277; reject Ho.

2(a) Test statistic: c2 = 14.4; Critical value: c2 = 5.991; reject Ho.


(b) No change; reject Ho. Critical value: c2 will be smaller than 5.991.

3(a) Test statistic: c2 = 4.514; Critical value: c2 = 3.841; reject Ho.


(b) Only large value of χ2 is likely to lead to a rejection of Ho as observed and expected
frequencies are significantly different.

Session 11: Supplementary questions

1. Test statistic: c2 = 0.029; Critical value: c2 = 3.841; do not reject Ho.

2. Test statistic: c2 = 32.24; Critical value: c2 = 9.210; reject Ho.

3. Test statistic: c2 = 7.963; Critical value: c2 = 11.345; do not reject Ho.

4. Test statistic: c2 = 3.048; Critical value: c2 = 6.635; do not reject Ho.

200
Session 12: Discussion questions

1. Test statistic: F = 13.52; Critical value: F = 3.89; reject Ho.

2. Test statistic: F = 1.91; Critical value: F = 5.78; do not reject Ho.

3(a) H0 : µChina = µJapan = µIndia = µKorea


H1: The means are not all equal
(b) Critical value: F = 3.01
(c) Test statistic: F = 0.89; Critical value: F = 3.01; do not reject Ho.
(d) No. Both variables are qualitative.

4(a) H0 : µ3-room = µ4-room = µ5-room


H1: The means are not all equal
(b) A: SST=4.2239; B:MSE=1.763; C:F statistic = 1.198
(c) Test statistic: F = 1.198; Critical value: F = 3.59; do not reject Ho.

Session 12: Supplementary questions

1(a)

(b) Test statistic: F = 4.07; Critical value: F = 3.29; reject Ho.

2(a) 𝐻V : 𝜇d = 𝜇L = 𝜇T
𝐻d : Not all means are equal.
(b) Reject 𝐻V if F Statistic > 6.93.
(c) SS Total = 120.9333; SSE = 107.2; SST = 13.7333
(d)

(e) Test statistic: F = 0.769; Critical value: F = 6.93; do not reject Ho.

3(a) 𝐻V : 𝜇d = 𝜇L = 𝜇T = 𝜇W = 𝜇X
𝐻d : The mean sales are not all equal.
(b) Reject 𝐻V if F Statistic > 5.99.
(c) Test statistic: F = 22.4423; Critical value: F = 5.99; reject Ho.

4(a) Programme M. Average sales = 67 units


(b) H0 : µ K = µL = µM
H1: The means are not all equal
(c) Test statistic: F = 2.4024; Critical value: F = 4.30; do not reject Ho.

201
5(a) k=3; n=20
(b) H0 : µladies = µrevolution = µplatinum
H1: The means are not all equal
(c) F statistic = 0.359
(d) Test statistic: F = 0.359; Critical value: F = 6.11; do not reject Ho.

202
Appendix 1_1
Key Formulas

∑Q
Population Mean 𝜇=
R

∑Q
Sample Mean 𝑥̅ =
&

Range Largest value – Smallest value

∑(Qgh)j
Population variance 𝜎L = R

∑(Qgh)j
Population standard deviation 𝜎=k R

∑(QgQ̅ )j
Sample variance 𝑠L = &gd

∑(QgQ̅ )j
Sample standard deviation 𝑠=k
&gd

∑ +x
Sample mean, grouped data 𝑥̅ = &

∑ +(xgQ̅ )j
Sample standard deviation, grouped data 𝑠=k
&gd

Special rule of addition P(A or B) = P(A) + P(B)

General rule of addition P(A or B) = P(A) + P(B) - P(A and B)

Special rule of multiplication P(A and B) = P(A) X P(B)

General rule of multiplication P(A and B) = P(A) X P(B|A)

Complement rule P(A) = 1 – P(~A)

„(…Â ).„(ä|…Â )
Bayes’ Theorem 𝑃 (𝐴d |𝐵) = „(… ).„(ä|…
  )I„(…j ).„(ä|…j )

Multiplication formula Total arrangements = (m)(n)

& &!
Number of permutations 𝑃" = (&g")!

& &!
Number of combinations 𝐶" = "!(&g")!

203
Appendix 1_2

Mean of a probability distribution 𝜇 = ∑[𝑥. 𝑃(𝑥)]

Variance of a probability distribution 𝜎 L = ∑[(𝑥 − 𝜇 )L . 𝑃(𝑥 )]

Qg h
Standard normal value 𝑧=
¤

Q̅ g h
𝑧= º
√»

¤
Standard error of mean 𝜎Q̅ =
√&

¤
Confidence interval for µ 𝑥̅ ± 𝑧
√&

4
𝑥̅ ± 𝑧
√&
4
𝑥̅ ± 𝑡 with df = n-1
√&

Q
Sample proportion 𝑝=&

K(dgK)
Confidence interval for proportion 𝑝 ± 𝑧k
&

´¤
Sample size for estimating mean 𝑛 = [ µ ]L
´
Sample size for proportion 𝑛 = 𝑝(1 − 𝑝)[µ]L

Q̅ g h
Testing of hypothesis, one mean 𝑧= º
√»

Q̅ g h
𝑧= ¾
√»

𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛

204
Appendix 1_3

KgÃ
Test of hypothesis, one proportion 𝑧=
Ä(ÂÅÄ)
k
»

å (X - X )
2
Sum of Squares, Total SS Total = G

å (X - X )
2
Sum of Squares, Error SSE = C

Sum of Squares, Treatment SST = SS Total - SSE

00ÇD
(Ígd)
F statistic 𝐹= 00µD
(&gÍ)

df (numerator) = k -1

df (denominator) = n - k

(ægµ)j
Chi-Square statistic 𝜒 L = ∑[ µ
]

Æ*H Ç*,.2 ¸ È*2%1& Ç*,.2


Expected frequency 𝐸= &

df = (r-1)(c-1)

Linear Regression & Correlation X= åX Y= åY


n n

& ∑ ¸çg∑ ¸ ∑ ç
Correlation coefficient 𝑟=
k[& ∑ ¸ j g(∑ ¸)j ][& ∑ ç j g(∑ ç)j ]

. ∑ çIè ∑ ¸çg&ç’ j
Coefficient of determination 𝑟L = ∑ ç j g&ç’ j

Linear Regression Equation 𝑌ž = 𝑎 + 𝑏𝑋


& ∑ ¸çg∑ ¸ ∑ ç
The estimated slope coefficient 𝑏=
& ∑ ¸ j g(∑ ¸)j

The estimated intercept coefficient 𝑎 = 𝑌’ − 𝑏𝑋’

205
Appendix 2
AREA UNDER THE NORMAL CURVE
Area of 0.4750

0 Z = 1.96

___________________________________________________________________________

Example: To find the area under the curve between the mean and a point 1.96 standard deviations to
the right of the mean, look up the value opposite 1.9 and under 0.06 in the table; 0.4750 of the area
under the curve lies between the mean and a z value of 1.96.
___________________________________________________________________________

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 * 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 * 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990

NOTE: For values of z above 3.09, use 0.4999 for the area.

206
Appendix 3
t-DISTRIBUTION
Left tail Right Two
tail tails

Critical Critical Critical Critical


t-score t-score t-score t-score
(negative) (positive) (negative) (positive)

.005 .01 .025 .05


Degrees of (one tail) (one tail) (one tail) (one tail)
freedom .01 .02 .05 .10
(two tails) (two tails) (two tails) (two tails)
1 63.657 31.821 12.706 6.314
2 9.925 6.965 4.303 2.920
3 5.841 4.541 3.182 2.353
4 4.604 3.747 2.776 2.132
5 4.032 3.365 2.571 2.015
6 3.707 3.143 2.447 1.943
7 3.500 2.998 2.365 1.895
8 3.355 2.896 2.306 1.860
9 3.250 2.821 2.262 1.833
10 3.169 2.764 2.228 1.812
11 3.106 2.718 2.201 1.796
12 3.054 2.681 2.179 1.782
13 3.012 2.650 2.160 1.771
14 2.977 2.625 2.145 1.761
15 2.947 2.602 2.132 1.753
16 2.921 2.584 2.120 1.746
17 2.898 2.567 2.110 1.740
18 2.878 2.552 2.101 1.734
19 2.861 2.540 2.093 1.729
20 2.845 2.528 2.086 1.725
21 2.831 2.518 2.080 1.721
22 2.819 2.508 2.074 1.717
23 2.807 2.500 2.069 1.714
24 2.797 2.492 2.064 1.711
25 2.787 2.485 2.060 1.708
26 2.779 2.479 2.056 1.706
27 2.771 2.473 2.052 1.703
28 2.763 2.467 2.048 1.701
29 2.756 2.462 2.045 1.699
Large (Z) 2.575 2.327 1.960 1.645

207
Appendix 4

Area in the Right Tail of a Chi-square Distribution

EXAMPLE: In a chi-square distribution with 11 degrees of freedom. If we want to find the


appropriate chi-square value for 0.05 of the area under the curve (the area in the right tail) we look
under the 0.05 column in the table and proceed down to the 11 degrees of freedom row; the
appropriate chi-square value there is 19.675.

Area in right tail


Degrees of freedom .05 .01
1 3.841 6.635
2 5.991 9.210
3 7.815 11.345
4 9.488 13.277
5 11.070 15.086
6 12.592 16.812
7 14.067 18.475
8 15.507 20.090
9 16.919 21.666
10 18.307 23.209
11 19.675 24.725
12 21.026 26.217
13 22.362 27.688
14 23.685 29.141
15 24.996 30.578
16 26.296 32.000
17 27.587 33.409
18 28.869 34.805
19 30.144 36.191
20 31.410 37.566
21 32.671 38.932
22 33.924 40.289
23 35.172 41.638
24 36.415 42.980
25 37.652 44.314
26 38.885 45.642
27 40.113 46.963
28 41.337 48.278
29 42.557 49.588
30 43.773 50.892

208
Appendix 5_1
Values of F for F Distributions with .05 of the Area in the Right Tail

EXAMPLE: For a test of a significance of .05 where we have 15 degrees of freedom for the numerator and 6 degrees of freedom
for the denominator, the appropriate F value is found by looking under the 15 degrees of freedom column and proceeding down
to the 6 degrees of freedom row; there we find the appropriate F value to be 3.94

Degrees of freedom for numerator


1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 x
1 161 200 216 225 230 234 237 239 241 242 244 246 248 249 250 251 252 253 254
2 18.5 19.0 19.2 19.3 19.3 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.5 19.5 19.5 19.5 19.5 19.5
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.7 8.66 8.64 8.62 8.59 8.57 8.55 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.37
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
9 5.12 4.30 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54
Degrees of freedom for denominator

11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25
x 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 4.57 15.20 1.46 1.39 1.32 1.22 1.00

209
Appendix 5_2
Values of F for F Distributions with .01 of the Area in the Right Tail

EXAMPLE: For a test of a significance of .01 where we have 7 degrees of freedom for the numerator and 5 degrees of freedom
for the denominator, the appropriate F value is found by looking under the 7 degrees of freedom column and proceeding down to
the 5 degrees of freedom row; there we find the appropriate F value to be 10.5

Degrees of freedom for numerator


1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120

1 4.052 5.000 5.403 5.625 5.764 5.859 5.928 5.982 6.023 6.056 6.106 6.157 6.209 6.235 6.261 6.287 6.313 6.339
2 98.5 99 99.2 99.2 99.3 99.3 99.4 99.4 99.4 99.4 99.4 99.4 99.4 99.5 99.5 99.5 99.5 99.5
3 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 26.9 26.7 26.6 26.5 26.4 26.3 26.2
4 21.2 18.0 16.7 16 15.5 15.2 15.0 14.8 17.7 14.5 14.4 14.2 14.0 13.9 13.8 13.7 13.7 13.6
5 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1 9.89 9.72 9.55 9.47 9.38 9.29 9.20 9.11
6 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56 7.40 7.31 7.23 7.14 7.06 6.97
7 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74
8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 5.36 5.28 5.20 5.12 5.03 4.95
9 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40
10 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4.00
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.25 4.10 4.02 3.94 3.86 3.78 3.69
Degrees of freedom for denominator

12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 3.86 3.78 3.70 3.62 3.54 3.45
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.82 3.66 3.59 3.51 3.43 3.34 3.25
14 8.86 6.51 5.56 5.04 4.70 4.46 4.28 4.14 4.03 3.94 3.80 3.66 3.51 3.43 3.35 3.27 3.18 3.09
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52 3.37 3.29 3.21 3.13 3.05 2.96
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.41 3.26 3.18 3.10 3.02 2.93 2.84
17 8.40 6.11 5.19 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.31 3.16 3.08 3.00 2.92 2.83 2.75
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.23 3.08 3.00 2.92 2.84 2.75 2.66
19 8.19 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.15 3.00 2.92 2.84 2.76 2.67 2.58
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.03 2.88 2.80 2.72 2.64 2.55 2.46
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 2.98 2.83 2.75 2.67 2.58 2.50 2.40
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.93 2.78 2.70 2.62 2.54 2.45 2.35
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.89 2.74 2.66 2.58 2.49 2.40 2.31
25 7.77 5.57 4.68 4.18 3.86 3.63 3.46 3.32 3.22 3.13 2.99 2.85 2.70 2.62 2.53 2.45 2.36 2.27
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.47 2.39 2.30 2.21 2.11
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.52 2.37 2.29 2.20 2.11 2.02 19.2
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53
x 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.70 1.59 1.47 1.32

210

You might also like