You are on page 1of 262

BASIC STATISTICS

A STEP BY STEP APPROACH

Dinesh Krishna Rao, The University of the South Pacific


License

© 2018 The University of the South Pacific (USP). Except where otherwise noted, this work is licensed
under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a
copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/.

This work was carried out with the aid of a grant from the Office of the Deputy Vice-Chancellor Learning,
Teaching and Student Services (LTSS), USP as part of the Open Educational Resources (OER) Course
Conversion project.

Disclaimer
“The publication is released for educational purposes, and all information provided is in an ‘as is’ basis.
Although the author and publisher have made every effort to ensure that the information in this publication
was correct at the time of going to press, the author and publisher do not assume and hereby disclaim
any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such
errors or omissions result from negligence, accident, or any other cause. Any views expressed in the
publication are that of the author, and do not necessarily reflect the views of The University of the South
Pacific. All products and services mentioned are owned by their respective copyright holders, and mere
presentation in the publication does not mean endorsement by The University of the South Pacific.
Derivatives of this work are not authorized to use the logo of The University of the South Pacific.”

Basic Statistics – A Step By Step Approach ii


Acknowledgement
I would like to thank God, the almighty for giving me knowledge, health and blessing to write this book.
Special thanks to Mr Ravinesh Chand, Teaching Assistant in Mathematics for helping me compile the
exercises and their solutions.

Preface
This book entitled “Basic Statistics - A Step By Step Approach” is designed to be used in a basic statistics
course. It introduces students to basic concepts in statistics using a step by step approach and will be a
very handy resource for a first course in statistics. The book also includes lots of examples and exercises
with solutions to help students understand concepts better. This book has fourteen chapters and an
appendix.

This book makes reference to the Eton Statistical and Maths Tables (4th Edition) published by Pearson
New Zealand.

Chapter 1: Introduction to Statistics


This chapter introduces statistics. It explains the basic terms and concepts such as statistics; branches
of statistics; types of variables; techniques to collect data; sampling techniques; observational and
experimental studies. The chapter concludes with a summary and a set of exercises.

Chapter 2: Frequency Distributions and Graphs


This chapter explains how to organize and present data. The concepts discussed in this chapter are as
follows: organizing data; graphical presentation of data; shape of distributions; stem and leaf plots. The
chapter concludes with a summary and a set of exercises.

Chapter 3: Data Description


This chapter discusses how data can be described using statistical methods. The concepts discussed in
this chapter are as follows: measure of central tendency; measure of variation; measure of position;
outliers; exploratory data analysis. The chapter concludes with a summary and a set of exercises.

Chapter 4: Probability (Part I)


This chapter introduces the concepts of probability. It explains the basic terms and concepts such as
probability; probabilistic experiments; sample space; event; complement intersection and union of events;
classical, empirical and subjective probability; the additional rule and mutually exclusive events. The
chapter concludes with a summary and a set of exercises.

Chapter 5: Probability (Part II)


This chapter explains the more sophisticated concepts in probability such as independent events;
conditional probability; probability and counting rules. The chapter concludes with a summary and a set
of exercises.

Chapter 6: Discrete Probability Distributions


This chapter explains the concept of discrete probability distribution. The concepts discussed in this
chapter are as follows: random variable; discrete probability distribution; mean and variance of discrete
probability distribution; binomial distribution. The chapter concludes with a summary and a set of
exercises.

Basic Statistics – A Step By Step Approach iii


Chapter 7: The Normal Distribution
This chapter discusses the normal distribution. The concepts discussed in this chapter are as follows: the
normal distribution; standard normal distribution; applications of normal distribution; the central limit
theorem. The chapter concludes with a summary and a set of exercises.

Chapter 8: Confidence Intervals and Sample Size


This chapter explains how to construct confidence interval and determine minimum sample size. The
concepts discussed in this chapter are as follows: confidence interval for population mean and population
proportion; minimum sample size needed in population mean and population proportion estimation. The
chapter concludes with a summary and a set of exercises.

Chapter 9: Hypothesis Testing (Part I)


This chapter introduces the concept of hypothesis testing. The concepts discussed in this chapter are as
follows: statistical hypothesis; null and alternate hypothesis; statistical test; type I and type II error; level
of significance; critical and non-critical region; z− test for mean; methods of hypothesis testing. The
chapter concludes with a summary and a set of exercises.

Chapter 10: Hypothesis Testing (Part II)


This chapter discusses the t− test for mean and the z− test for population proportion. The chapter
concludes with a summary and a set of exercises.

Chapter 11: Testing the Equality of Two Population Means


This chapter explains the hypothesis testing of the equality of two population means. The concepts
discussed in this chapter are z− test and the t− test for testing two population means. The chapter
concludes with a summary and a set of exercises.

Chapter 12: Correlation and Regression


This chapter explains the concepts of correlation and regression. The concepts discussed in this chapter
are scatter plots, correlation coefficient, testing the significance of correlation, regression line and
coefficient of determination. The chapter also discusses the concept of multiple linear regressions. The
chapter concludes with a summary and a set of exercises.

Chapter 13: The Chi-Square Tests


This chapter focuses on the chi-square tests to analyse categorical data. The chi-square tests discussed
are: (1) test for goodness of fit; (2) test for independence of variables. The chapter concludes with a
summary and a set of exercises.

Chapter 14: Analysis of Variance


This chapter explains the concepts of analysis of variance (ANOVA). The concepts discussed in this
chapter are F-distribution, one-way and two-way analysis of variance. The chapter concludes with a
summary and a set of exercises.

Appendix A: Answers to Exercises


Appendix A provides the solutions to the exercises included in the textbook.

Basic Statistics – A Step By Step Approach iv


Table of Contents
License ................................................................................................................................................ ii
Disclaimer ........................................................................................................................................... ii
Acknowledgement ...............................................................................................................................iii
Preface ................................................................................................................................................iii
CHAPTER 1: INTRODUCTION TO STATISTICS................................................................................... 1
Overview ............................................................................................................................................. 2
Objectives ........................................................................................................................................... 2
1.1 Introduction .............................................................................................................................. 2
1.2 What is Statistics? ................................................................................................................... 2
1.3 Reasons to Study Statistics ..................................................................................................... 3
1.4 Branches of Statistics .............................................................................................................. 3
1.4.1 Descriptive Statistics........................................................................................................ 4
1.4.2 Inferential Statistics.......................................................................................................... 5
1.5 Variables and Data Types ....................................................................................................... 5
1.6 Data Collection Techniques ..................................................................................................... 8
1.7 Sampling Techniques .............................................................................................................. 9
1.8 Observational and Experimental Studies ............................................................................... 10
1.8.1 Observational Study ...................................................................................................... 10
1.8.2 Experimental Study........................................................................................................ 10
1.9 Summary ............................................................................................................................... 11
CHAPTER 2: FREQUENCY DISTRIBUTION AND GRAPHS .............................................................. 13
Overview ........................................................................................................................................... 14
Objectives ......................................................................................................................................... 14
2.1 Introduction ............................................................................................................................ 14
2.2 Organizing Data ..................................................................................................................... 14
2.2.1 Categorical Frequency Distribution ................................................................................ 14
2.2.2 Ungrouped Frequency Distribution ................................................................................ 16
2.2.3 Grouped Frequency Distribution .................................................................................... 18
2.3 Graphical Presentation of Data .............................................................................................. 21
2.3.1 Bar Graphs .................................................................................................................... 21
2.3.2 Pareto Charts................................................................................................................. 22
2.3.3 Time Series Graphs ....................................................................................................... 23
2.3.4 Pie Charts ...................................................................................................................... 24
2.3.5 Histograms .................................................................................................................... 25
2.3.6 Frequency Polygons ...................................................................................................... 26
2.3.7 Ogives ........................................................................................................................... 28

Basic Statistics – A Step By Step Approach v


2.3.8 Relative Frequency Graphs ........................................................................................... 29
2.4 Distribution Shapes................................................................................................................ 31
2.5 Stem and Leaf Plots .............................................................................................................. 32
2.6 Summary ............................................................................................................................... 33
CHAPTER 3: DATA DESCRIPTION..................................................................................................... 34
Overview ........................................................................................................................................... 35
Objectives ......................................................................................................................................... 35
3.1 Introduction ............................................................................................................................ 35
3.2 Measures of Central Tendency .............................................................................................. 35
3.2.1 The Mean ...................................................................................................................... 36
3.2.2 The Median .................................................................................................................... 39
3.2.3 The Mode ...................................................................................................................... 41
3.2.4 The Midrange................................................................................................................. 43
3.2.5 The Weighted Mean ...................................................................................................... 43
3.2.6 Relationships among Mean, Median and Mode ............................................................. 44
3.3 Measures of Variation ............................................................................................................ 45
3.3.1 Range ............................................................................................................................ 46
3.3.2 The Variance and Standard Deviation ........................................................................... 46
3.3.3 Coefficient of Variation ................................................................................................... 50
3.4 Measures of Position ............................................................................................................. 50
3.4.1 Standard Scores ............................................................................................................ 51
3.4.2 Percentiles ..................................................................................................................... 51
3.4.3 Deciles ........................................................................................................................... 53
3.4.4 Quartiles ........................................................................................................................ 53
3.4.5 Other Measures of Variation .......................................................................................... 56
3.5 Outliers .................................................................................................................................. 57
3.6 Exploratory Data Analysis (EDA) ........................................................................................... 58
3.7 Summary ............................................................................................................................... 59
CHAPTER 4: PROBABILITY (PART I) ................................................................................................. 61
Overview ........................................................................................................................................... 62
Objectives ......................................................................................................................................... 62
4.1 Introduction ............................................................................................................................ 62
4.2 Basic Concepts in Probability ................................................................................................ 63
4.2.1 Event ............................................................................................................................. 66
4.2.2 Complement of an Event ............................................................................................... 67
4.2.3 Intersection of Two Events............................................................................................. 67
4.2.4 Union of Two Events...................................................................................................... 67

Basic Statistics – A Step By Step Approach vi


4.3 Interpretations of Probability .................................................................................................. 68
4.3.1 Classical Probability....................................................................................................... 68
4.3.2 Empirical or Relative Frequency Probability .................................................................. 70
4.3.3 Subjective Probability .................................................................................................... 72
4.4 The Addition Rules for Probability.......................................................................................... 72
4.5 Summary ............................................................................................................................... 74
CHAPTER 5: PROBABILITY (PART II) ................................................................................................ 76
Overview ........................................................................................................................................... 77
Objectives ......................................................................................................................................... 77
5.1 Introduction ............................................................................................................................ 77
5.2 Independent Events ............................................................................................................... 77
5.3 Conditional Probability and Dependent Events ...................................................................... 79
5.3.1 Conditional Probability ................................................................................................... 79
5.3.2 Dependent Events ......................................................................................................... 81
5.4 Counting Rules ...................................................................................................................... 83
5.4.1 Fundamental Counting Rule .......................................................................................... 83
5.4.2 Permutation Rule ........................................................................................................... 84
5.4.3 Combination Rule .......................................................................................................... 86
5.5 Probability and Counting Rules.............................................................................................. 87
5.6 Summary ............................................................................................................................... 88
CHAPTER 6: DISCRETE PROBABILITY DISTRIBUTIONS................................................................. 91
Overview ........................................................................................................................................... 92
Objectives ......................................................................................................................................... 92
6.1 Introduction ............................................................................................................................ 92
6.2 Random Variables ................................................................................................................. 92
6.3 Discrete Probability Distribution ............................................................................................. 93
6.4 Mean, Variance and Standard Deviation of Discrete Distribution .......................................... 96
6.4.1 The Mean ...................................................................................................................... 96
6.4.2 The Variance and Standard Deviation ........................................................................... 96
6.5 The Binomial Distribution ....................................................................................................... 98
6.5.1 Requirement of Binomial Experiments ........................................................................... 98
6.5.2 Binomial Probability Formula ......................................................................................... 99
6.5.3 Mean Variance and Standard Deviation of the Binomial Distribution ........................... 102
6.6 Summary ............................................................................................................................. 103
CHAPTER 7: THE NORMAL DISTRIBUTION .................................................................................... 105
Overview ......................................................................................................................................... 106
Objectives ....................................................................................................................................... 106

Basic Statistics – A Step By Step Approach vii


7.1 Introduction .......................................................................................................................... 106
7.2 The Normal Distribution ....................................................................................................... 106
7.2.1 Properties of Normal Distribution ................................................................................. 107
7.2.2 Standard Normal Distribution ....................................................................................... 108
7.3 Applications of Normal Distribution ...................................................................................... 111
7.4 The Central Limit Theorem .................................................................................................. 117
7.4.1 The sampling distribution of Sample Mean ( X ) ......................................................... 117
7.4.2 Properties of the sampling distribution of the Sample Mean ........................................ 117
7.4.3 The Central Limit Theorem .......................................................................................... 118
7.5 Summary ............................................................................................................................. 121
CHAPTER 8: CONFIDENCE INTERVALS AND SAMPLE SIZE ........................................................ 122
Overview ......................................................................................................................................... 123
Objectives ....................................................................................................................................... 123
8.1 Introduction .......................................................................................................................... 123
8.2 Estimation ............................................................................................................................ 123
8.2.1 Properties of a Good Estimator.................................................................................... 123
8.2.2 Types of Estimates ...................................................................................................... 124
8.3 Confidence Intervals and Sample Size for the Mean when  is known ............................. 124
8.3.1 Formula for the Confidence Interval............................................................................. 124
8.3.2 Formula for Minimum Sample Size .............................................................................. 127
8.4 Characteristics of the t-distribution....................................................................................... 128
8.4.1 Formula for the Confidence Interval............................................................................. 128
8.5 Confidence Intervals and Sample Size for Proportion ......................................................... 130
8.5.1 Sampling Distribution of Sample Proportion ................................................................ 131
8.5.2 Confidence Interval for Proportion ............................................................................... 132
8.5.3 Formula for Minimum Sample Size .............................................................................. 133
8.6 Summary ............................................................................................................................. 134
CHAPTER 9: HYPOTHESIS TESTING (PART I) ............................................................................... 135
Overview ......................................................................................................................................... 136
Objectives ....................................................................................................................................... 136
9.1 Introduction .......................................................................................................................... 136
9.2 Concepts of Hypothesis Testing .......................................................................................... 136
9.2.1 Statistical Hypothesis................................................................................................... 136
9.2.2 Type of Statistical Hypothesis ...................................................................................... 136
9.2.3 Statistical Test ............................................................................................................. 138
9.2.4 Level of Significance .................................................................................................... 139
9.2.5 Critical Region, Acceptance Region and Critical Value................................................ 139

Basic Statistics – A Step By Step Approach viii


9.3 z-test for Mean ..................................................................................................................... 140
9.3.1 Test Statistic ................................................................................................................ 140
9.4 Methods of Hypothesis Testing............................................................................................ 143
9.4.1 The P-value Method .................................................................................................... 143
9.4.2 The Confidence Interval Method .................................................................................. 146
9.5 Summary ............................................................................................................................. 147
CHAPTER 10: HYPOTHESIS TESTING ............................................................................................ 149
Overview ......................................................................................................................................... 150
Objectives ....................................................................................................................................... 150
10.1 Introduction ...................................................................................................................... 150
10.2 t-test for Mean.................................................................................................................. 150
10.3 z-test for Proportion ......................................................................................................... 152
10.4 Summary ......................................................................................................................... 155
CHAPTER 11: TESTING THE EQUALITY OF TWO POPULATION MEANS .................................... 157
Overview ......................................................................................................................................... 158
Objectives ....................................................................................................................................... 158
11.1 Introduction ...................................................................................................................... 158
11.2 z-test for two Means ........................................................................................................ 158
11.2.1 Dependent and Independent Samples ........................................................................ 158
11.2.2 Hypothesis ................................................................................................................... 159
11.3 t-test for Two Means (Independent Samples) .................................................................. 162
11.4 Summary ......................................................................................................................... 165
CHAPTER 12: CORRELATION AND REGRESSION ........................................................................ 166
Overview ......................................................................................................................................... 167
Objectives ....................................................................................................................................... 167
12.1 Introduction ...................................................................................................................... 167
12.2 Correlation ....................................................................................................................... 167
12.2.1 Scatter Plots ................................................................................................................ 168
12.2.2 The Correlation Coefficient .......................................................................................... 171
12.2.3 Hypothesis Testing of Correlation Coefficient .............................................................. 173
12.3 Simple Linear Regression ................................................................................................ 176
12.4 Multiple Linear Regression .............................................................................................. 179
12.5 Summary ......................................................................................................................... 181
CHAPTER 13: THE CHI-SQUARE TESTS ........................................................................................ 182
Overview ......................................................................................................................................... 183
Objectives ....................................................................................................................................... 183
13.1 Introduction ...................................................................................................................... 183

Basic Statistics – A Step By Step Approach ix


13.2 The Chi-square Distribution ............................................................................................. 183
13.3 Test for Goodness of fit ................................................................................................... 184
13.4 Test for Independence ..................................................................................................... 187
13.5 Summary ......................................................................................................................... 192
CHAPTER 14: ANALYSIS OF VARIANCE ......................................................................................... 193
Overview ......................................................................................................................................... 194
Objectives ....................................................................................................................................... 194
14.1 Introduction ...................................................................................................................... 194
14.2 The F−distribution ............................................................................................................ 194
14.2.1 Characteristics of F-Distribution ................................................................................... 194
14.3 One-Way Analysis of Variance ........................................................................................ 195
14.4 Two-Way Analysis of Variance ........................................................................................ 204
14.5 Summary ......................................................................................................................... 209
REFERENCES .................................................................................................................................... 211
APPENDIX A: ANSWERS TO EXERCISES ....................................................................................... 212
Chapter 1: Introduction to Statistics................................................................................................. A-1
Chapter 2: Frequency Distributions and Graphs ............................................................................. A-2
Chapter 3: Data Description ............................................................................................................ A-5
Chapter 4: Probability (Part I) .......................................................................................................... A-9
Chapter 5: Probability (Part II) ....................................................................................................... A-11
Chapter 6: Discrete Probability Distributions ................................................................................. A-13
Chapter 7: The Normal Distribution ............................................................................................... A-15
Chapter 8: Confidence Intervals and Sample Size ........................................................................ A-18
Chapter 9: Hypothesis Testing (Part I) .......................................................................................... A-20
Chapter 10: Hypothesis Testing (Part II) ....................................................................................... A-24
Chapter 11: Testing the Equality of Two Population Means .......................................................... A-28
Chapter 12: Correlation and Regression ....................................................................................... A-33
Chapter 13: The Chi-Square Tests................................................................................................ A-36
Chapter 14: Analysis of Variance .................................................................................................. A-38

Basic Statistics – A Step By Step Approach x


CHAPTER 1:

INTRODUCTION TO STATISTICS

Chapter 1: Introduction to Statistics 1


Overview
This chapter provides an introduction to statistics. It explains the basic terms and concepts such as
statistics; branches of statistics; types of variables; techniques to collect data; sampling techniques;
observational and experimental studies. The chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Define some statistical terms.
2. Differentiate between descriptive and inferential statistics.
3. Identify types of variables.
4. Identify the measurement levels for each variable.
5. Identify the sampling technique used.
6. Differentiate between an observational and an experimental study.

1.1 Introduction
This chapter provides an introduction to statistics. It explains the basic terms and concepts such as
statistics; branches of statistics; types of variables; techniques to collect data; sampling techniques;
observational and experimental studies.

1.2 What is Statistics?


You may be familiar with statistics through radio, television, newspapers and magazines. For example,
you may have heard or read statements like the following:
 National Fire Authority of Fiji managed to reduce the number of structural fires by 50% over the
period of 2017.
 There is an increase from ST$50.00 to ST$63.00 in weekly fuel costs for an average family
in Samoa due to the worldwide oil crisis.
 The Fiji national population is expected to top the 1 million mark by December 2016.
 Teenagers need at least 8 hours of sleep per day.
 7 out of 30 players selected in the Tongan team for the 2018 IRB series being under 19
years’ old.
 In Samoa, 20% of adults aged 25 and older have at least a bachelor’s degree.
Almost every human being is exposed daily to some probability and statistics in his or her life. In
sports you may keep records of the number of goals scored by a player in a soccer league season or in
public health, an administrator might want to know whether there is any relationship between a person’s
age, weight and blood pressure or in education, a teacher might want to know if the new methods of
teaching are better than old ones. These are only a few examples of how one can make use of statistics
in their occupation.

The word statistics, however, is used to mean two different things. One of the definitions is that statistics
are numbers measured for some purpose. A more complete definition is the following:

Statistics can be defined as the science of conducting studies to collect, organize, summarize,
analyze and draw meaningful conclusions from it.

Chapter 1: Introduction to Statistics 2


1.3 Reasons to Study Statistics
 To be able to read and understand the various statistical studies performed in your fields. To have
this understanding, you must be knowledgeable about the vocabulary, symbols, concepts, and
statistical procedures used in these studies.
 To be able to conduct research in your area of interest, since statistical procedures are basic to
research. To accomplish this, you must be able to design experiments; collect, organize, analyse,
and summarize data; and possibly make reliable predictions or forecasts for future use.
 To be able to use the knowledge gained from studying statistics to because better consumers
and citizens. For example, you can make intelligent decisions about what products to purchase
based on consumer studies.

EXAMPLE 1−1

Briefly describe the two meanings of the word statistics.

SOLUTION The word "statistics" has the following two meanings:


First, it refers to numerical facts such as the ages of persons, incomes of families etc.
Second, it refers to the field of study. It provides us with techniques that help us to
collect, analyse, present, and interpret data and to make decisions.

1.4 Branches of Statistics


Applied statistics can be divided into two branches:
1. Descriptive statistics; and
2. Inferential statistics.

To describe the two branches of statistics, it is useful to know the definition of the following statistical
terms:

A variable is a characteristic or attribute that can assume different values. Variables whose
values are determined by chance are called random variables.

Data are the values (measurement or observations) that the variables can assume. A collection
of data values forms a data set. Each value in the data set is called a data value or a datum.

A population consists of all subjects that are being studied. A sample is a subset of the
population.

Statistic is a characteristic or a measure obtained by using the data values from a sample.
Parameter is a characteristic or a measure obtained by using all the data values from a
population.

A survey that includes every member of the population is called a census. The technique of
collecting information from a portion of the population is called a sample survey.

Chapter 1: Introduction to Statistics 3


Suppose we measured and recorded the heights of all the students enrolled in ST130. In statistical
terminology, the variable is height, the measured height of a student is the data value e.g. 187cm, the set
of heights of students is called a data set and all the students of ST130 forms the population.

EXAMPLE 1−2

Explain whether each of the following constitutes a population or a sample.


A. Number of fish caught by all participants in a fishing trip.
B. Yield of sweet potatoes per acre for 10 pieces of land.
C. Ages of all players in a rugby team.
D. Number of traffic offences by 200 drivers in Vanuatu.

SOLUTION A. Population because all participants are being studied.


B. Sample because only 10 pieces of land are being studied.
C. Population because all players are being studied.
D. Sample because only 200 drivers are being studied.

EXAMPLE 1−3

An ANZ Bank Manager with 12,000 customers does a survey to gauge customer views on Internet
Banking which would incur less bank fees. In the survey, 21% of the 300 customers interviewed said that
they were interested in Internet Banking.
A. What is the population of interest?
B. What is the sample?
C. Is the value 21% a parameter or a statistic?

SOLUTION A. Views on internet banking of the 12,000 customers


B. Views on internet banking of the 300 customers
C. Statistic

1.4.1 Descriptive Statistics


In descriptive statistics, the statistician tries to describe a situation and is used when the entire population
is used.

A data set in its original form is called raw data and is usually very large. Consequently, such a data set
is not very helpful in making conclusions or decisions. It is easier to draw conclusions from summary
tables and diagrams than from the raw data. So, we reduce the raw data by constructing frequency tables,
drawing graphs, or calculating summary measures such as mean and standard deviations. The portion
of statistics that deals with this type of statistical analysis is called descriptive statistics. For example,
consider the national census conducted by Fiji Government. Results of this census give the average age,
income and other characteristics of the Fiji Population. It is an example of descriptive statistics because
population is being used here.

Chapter 1: Introduction to Statistics 4


Descriptive statistics consists of the collection, organization, summarization and presentation of
data.

1.4.2 Inferential Statistics


Inferential statistics is used when the population is too large to be studied and hence one uses the sample
to make conclusion about the population. It is also known as inductive reasoning or inductive statistics.
For example, say we want to determine the average height of people in Australia. Since the population
of Australia is too large it will be difficult to measure and record the heights of all the people in Australia.
Rather than using the population we can make use of the sample to estimate the average height of people
in Australia.

Inferential statistics consists of generalizing from samples to populations, performing estimations


and hypothesis tests, determining relationships among variables, and making predictions.

EXAMPLE 1−4

In each of the following statements, tell whether descriptive or inferential statistics have been used.
A. In the year 2020, 20000 students will be enrolled at USP.
B. Income for the cane farmers in Fiji were 1.2 million in 2017.
C. Research stated that the shape of a person’s ears is related to the person’s aggression.
D. The national average annual medicine expenditure per person is $1052.

SOLUTION A. Inferential
B. Descriptive
C. Inferential
D. Descriptive

1.5 Variables and Data Types


Variables represent data, and since data collected can vary in nature, they have to be classified into two
specific groups to accommodate all the data values: qualitative and quantitative.

Qualitative variables are those that cannot be assigned numerical values but are placed into distinct
categories determined by some attributes or characteristic. For example, gender can be categorized by,
either male or female. Colour, religion and geographical location are other examples. Quantitative
variables are variables that can take up numerical values, hence can be ranked or ordered. For example,
temperature can have any numerical value and be ordered from the either highest to lowest or vice versa.
Other examples include age, height, weight and volume.

Since quantitative variables can assume any numerical value, it is important to further categorize them
into discrete and continuous variables.

Chapter 1: Introduction to Statistics 5


Discrete variables assume values that can be counted. E.g. number of rooms in a building. Continuous
variables can assume infinite number of values between any two specific boundary values. These values
are often measured (as opposed to counted) and have fractions and decimals e.g. temperature.

We have seen the classification of variables into qualitative and quantitative variables. We now look at
how the variables can be classified by how they are categorized, counted or measured and for this we
use the levels of measurement. There are four levels of measurement: nominal, ordinal, interval, and
ratio. These go from lowest level to highest level. Data is classified according to the highest level which
it fits. Each additional level adds something the previous level didn't have.

Nominal level of measurement is used to describe qualitative variables, which cannot be assigned
numerical values and hence cannot be ordered. Examples whereby nominal measurement may be
applied include subject areas of study (Mathematics, Algebra, Statistics, Language, etc.) or colours (blue,
red, green, etc.)

The nominal level of measurement classifies data into mutually exclusive (non-overlapping),
exhausting categories in which no order or ranking can be imposed on the data.

Ordinal level of measurement describes qualitative data as well, but unlike nominal level of measurement,
it allows categorization that can be sorted or ranked. It is important to note, however, that precise
differences between the ranks do not exist. Examples include the grade letters (A, B, C, D, E, and F) and
positions achieved in a marathon (first, second, third, etc.)

The ordinal level of measurement classifies data into categories that can be ranked or ordered, but
precise differences do not exist between these categories.

The interval level of measurement differs from ordinal in the sense that precise differences do exist
between data. For example, the variable age can be ranked, and there exists a precise difference
between any two age values (2 units between the ages of 19 and 21). However, no meaningful 0 exists.
For example, in temperature measurement, 0°C does not mean no heat at all. Likewise, an IQ score of
0 does not mean the subject’s intelligence is zero.

The interval level of measurement ranks data, and precise differences between units of measure do
exist. However, there is no meaningful zero.

The ratio level of measurement has all properties of interval level but has a meaningful zero. Examples
include height, weight, salary, etc. A true ratio also exists between two measurements of the population.

The ratio level of measurement possesses all characteristics of the interval level of measurement
but there exists a meaningful zero.

Chapter 1: Introduction to Statistics 6


EXAMPLE 1−5

Indicate which of the following variables are quantitative and which are qualitative. Classify the
quantitative variables as discrete or continuous and classify the qualitative variables as nominal or
ordinal.
A. Number of road accidents in a year.
B. The time a student takes to walk to school.
C. Religion of people in Fiji.
D. Length of jump by athletes in long jump event.
E. Number of errors on each page of a book.
F. Grades of students at USP (A+, A, B+, B, etc.).
G. Shoe size of a person.
H. Education level of a sugarcane farmer.

SOLUTION A. Quantitative because the variable is numerical and discrete because the variable
is countable.
B. Quantitative because the variable is numerical and continuous because the
variable is measured.
C. Qualitative because the variable is categorical and nominal because the variable
has no order or ranking.
D. Quantitative because the variable is numerical and continuous because the
variable is measured.
E. Quantitative because the variable is numerical and discrete because the variable
is countable.
F. Qualitative because the variable is categorical and ordinal because the variable
has order or ranking.
G. Quantitative because the variable is numerical and continuous because the
variable is measured.
H. Qualitative because the variable is categorical and ordinal because the variable
has order or ranking.

EXAMPLE 1−6

Classify each of the following attributes as either categorical or numerical. For those that are numerical,
determine whether they are ratio or interval and for those that are categorical, determine whether they
are nominal or ordinal.

A. Marital status of patients at a medical clinic.


B. Thickness of the gelatine coating of a vitamin E capsule.
C. Temperature inside ten refrigerators at a supermarket.
D. Ratings of eight local soccer players (poor, fair, good, excellent).

Chapter 1: Introduction to Statistics 7


SOLUTION
A. categorical and nominal (because the variable is non numerical and has no
order or ranking)
B. numerical and ratio (because the variable is numerical and has a meaningful
zero)
C. numerical and interval (because the variable is numerical and doesn’t have a
meaningful zero)
D. categorical and ordinal (because the variable is non numerical and has order
or ranking)

1.6 Data Collection Techniques


Statistical studies need data, and data obviously has to be collected or gathered. Various data collection
techniques exist, which suit different surveying needs of statisticians. Some are stated below:

1. Personal Interviews This is an in person interview where the researcher asks a


standard set of questions.

Advantage Obtain in-depth responses to questions.

Disadvantages Costs more than the other two methods; interviewer maybe
biased on the selection of subjects or could even be
unknowingly influencing the responses of the interviewee.

2. Telephone Interviews This is an interview through phone where the researcher asks
a standard set of questions.
Advantage Costs less than personal interview; subjects tend to be more
candid in their opinions.

Disadvantages Not all subjects may have telephone access.

3. Mailed Questionnaire This is where a researcher prepares a questionnaire and then


Surveys mails it out to respondents for opinions.

Advantage Covers a wider area than telephone survey or personal


interview; subjects can remain anonymous; and it is less
costly.
Disadvantages Responses may not be encouraging (low number of
responses); answers may be inappropriate; and subjects may
have difficulty reading the questions or might misinterpret the
questions altogether.

Data can also be collected in other ways, such as surveying records or direct observation of situations.

Chapter 1: Introduction to Statistics 8


1.7 Sampling Techniques
In most cases, time and resource constraints like cost and manpower does not allow study of the entire
population, so samples must be drawn.

Since some populations being studied may be too large for descriptive statistics to be applied (i.e. collect
data about each and every individual subject), inferential statistics is applied instead. Therefore, samples
must be selected from the population very carefully and evenly to obtain the best applicable data. The
sampling techniques mainly used are random, systematic, stratified and cluster sampling.

Random sampling selects subjects by using chance methods or random numbers. E.g. numbering each
subject in the population and placing the numbered cards in a bowl/box/hat, then randomly selecting the
number of required cards from the bowl/box/hat. Random number tables are used by statisticians instead.

Systematic sampling requires each subject of the population to be numbered, and then select every kth
subject. The first member of the sample, however, will be selected at random. E.g. a sample of 50 is
needed from a population size of 2000; since 2000 ÷ 50 = 40, every 40th subject would be selected after
the first subject is randomly selected.

Stratified sampling divides the population into groups called strata according to some attribute or
characteristic important to the study, and then samples are drawn from each group. Samples drawn from
the strata are randomly selected. E.g. a study to determine obesity in the population is done and subjects
maybe divided into groups by gender, age group or ethnicity.

Cluster sampling divides the population into groups called clusters by some means such as geographical
location, schools or city/suburb. Then, some of the clusters are randomly selected and all subjects are
used from these clusters in the study. This sampling technique is normally used when the population size
is very large or when population is distributed across a large geographical area. This method is also cost-
effective. E.g. to study the eating habits of Fijians, certain villages or settlements maybe randomly
selected and all individuals for those villages used in the study.

EXAMPLE 1−7

Classify each sample as random, systematic, and stratified or cluster.


A. In a large school district, all teachers from two building are interviewed to determine whether they
believe the students have less homework to do now than in previous years.
B. Every 100th burger manufactured at Mc Donald’s is checked to determine its fat content.
C. Nursing supervisors are selected using random numbers to determine annual salaries.
D. The income of people in Fiji is divided into intervals. Then 10 of them is selected from each
interval.

SOLUTION
A. Cluster
B. Systematic
C. Random
D. Stratified

Chapter 1: Introduction to Statistics 9


1.8 Observational and Experimental Studies
Statistical studies may be classified in several different ways. We cover 2 types of these studies:
observational and experimental. For example, in a study of the migration of birds in massive flocks during
different seasons, observational study is utilized to conclude on where these birds go and why. No
measurements can be taken in this type of study.

1.8.1 Observational Study


In an observational study, the researcher merely observes the current happenings and those of the past
and tries to find a relationship between them to draw conclusions.

Advantages:
Usually occur in natural settings; they can be carried out in situations where it would be unethical or
downright dangerous for a researcher to conduct an experiment; can be carried out using variables that
cannot be manipulated by the researcher.

Disadvantages:
The researcher does not control variables; the data of other variables that have significant influences on
outcome variable may not be collected; can be expensive and time-consuming; and there are no
guarantees on the accuracy of the collected data.

1.8.2 Experimental Study


In an experimental study, a given variable is manipulated and its effects or influences on other variables
are determined.

For example, determining the effect of beauty products on the skin. Here, the beauty product is an
independent variable.

Advantages:
Can decide on how to select subjects; can decide on how to assign them to specific groups; control or
manipulate the independent variable.

Disadvantages:
Results may occur in unnatural settings; the behaviours of the participants in the study may be changed
because they knew they would participate in the study beforehand (this is known as Hawthorne effect);
presence of other variables (confounding variables) that the researcher did not choose but they influence
the outcome variable.

An independent variable in an experimental study is the one the one that is being manipulated by the
researcher. It is also called the explanatory variable. The resultant variable is called the dependent
variable or the outcome variable.

EXAMPLE 1−8

Identify each study as being either observational or experimental.


A. Subjects were randomly assigned to two groups, and one group was given an herb and other
group a placebo. After 6 months, the number of respiratory tract infections each group had was
compared.

Chapter 1: Introduction to Statistics 10


B. A researcher stood at a busy intersection to see if colour of an automobile that person drives is
related to running red lights.
C. A researcher finds that people who are more hostile have higher total cholesterol levels than
those who are less hostile.

SOLUTION A. experimental
B. observational
C. ovservational

1.9 Summary
This chapter introduced statistics. We have studied basic terms and concepts such as what is statistics;
why study statistics; variable; population and sample; statistic and parameter; census and sample survey;
descriptive and inferential statistics; the types of variable i.e. quantitative/qualitative, discrete/continuous,
nominal/ordinal/ratio/interval; techniques to collect data; the sampling techniques i.e.
simple/systematic/stratified/cluster; observational and experimental studies. This chapter will further help
readers understand the rest of the chapters better.

EXERCISES

1. In each statement, decide whether descriptive or inferential statistics is used:

A. The average life expectancy in Fiji is 79 years.


B. A diet high in fruits and vegetables will lower blood pressure.
C. The total amount of estimated losses from Cyclone Winston is more than 1 million dollars.
D. In 2020, the sea level will be higher than now.

2. A study of ST130 students in 2016 was undertaken to compare the average number of tutorial
session a student missed in 2016 with the previous year’s average of 3 classes. A random sample
of 35 students was surveyed and it was found that the mean number of missed classes for the 35
students is 2 days. Answer the following questions:

A. What is the variable used in this study?


B. Give an example of ‘statistic’ and ‘parameter’ from this study.

3. Classify each as nominal, ordinal, interval or ratio level of measurement.

A. Temperatures inside classrooms.


B. Level of performance (poor, fair, good, excellent).
C. Categories of magazines in a physician’s office (sports, women’s, health, men’s, news).
D. Time required by a student to complete the ST130 test.
E. The shoe size of staff members at the University of the South Pacific.

Chapter 1: Introduction to Statistics 11


4. Which sampling method is used in each case?

A. Interviewing every 5th customer leaving a theatre about the movie they had seen.
B. The country is divided into economic classes and a sample is chosen from each class to be
surveyed?
C. A researcher divided subjects into 4 geographical groups and then selected all members from
a randomly selected group as samples.
D. A Math’s tutor at USP is interested in the mean number of days an ST130 student is absent
from tutorial classes. The tutor takes her sample by gathering data on 5 randomly selected
students from ST130 course.
E. Questioning every 14th customer leaving a theatre about the movie they had seen.

5. When running an experimental study, the group that is manipulated can be called the treatment
group. True or False.

6. Explain the relationship between confounding, dependent, and independent variables.

7. For each of the following, state whether the variable is continuous or discrete:

A. Number of students in ST130 class.


B. Palm length in centimetres.
C. Number of crime incidents in Suva last month.
D. Time taken to complete an assignment.

8. Classify each variable as qualitative or quantitative.

A. Number of apples sold in Suva market every one hour.


B. Ranking of tennis players.
C. Colours of caps sold out from a shop.
D. Time it takes to cut the lawn.
E. Classification of children in a day care centre as infant, toddler, preschool.

9. Identify each study as being either observational or experimental.

A. A researcher on the busy street of Suva City asking random people that pass by how many pets
they have, then taking this data and using it to decide if there should be more pet food stores in
that area.
B. A researcher trying to determine the effects that eating strictly organic foods has on overall
health. The researcher finds 200 individuals, where 100 of them have eaten organically for the
past three years, and the other 100 have not eaten organically in the past three years.
C. A researcher trying to study the relation between the internet access and exam score of the
students. To do this, the students were randomly assigned to two groups, and only one group
was given the access. After 4 months, the exam score of two groups were compared.

Chapter 1: Introduction to Statistics 12


CHAPTER 2:

FREQUENCY DISTRIBUTION
AND GRAPHS

Chapter 2: Frequency Distributions and Graphs 13


Overview
This chapter explains how to organize and present data. The concepts discussed in this chapter are as
follows: organizing data; graphical presentation of data; shape of distributions; stem and leaf plots. The
chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Organize data using frequency distribution tables.
2. Represent qualitative data graphically using bar graphs, Pareto charts, time series graphs and
pie charts.
3. Represent quantitiave data graphically using histograms, frequency polygons and ogives.
4. Identify shape of frequency distributions.
5. Draw and interpret a stem and leaf plot.

2.1 Introduction
When conducting statistical studies, researchers collect data for a particular variable under study. For
example, if a researcher wishes to study the number of people who were infected with tuberculosis in
Suva over the past two years, he/she has to collect data from various doctors, hospitals and health
departments. In Chapter 1, we have learned some techniques the researchers can use to collect data.
The data that has not been processed for use which is in its original form is called raw data (sometimes
called source data or atomic data).
Since little information can be obtained from looking at the raw data, the researcher organizes the raw
data in some meaningful way and the most convenient method of organizing data is to construct a
frequency distribution.

After organizing the data, the researcher must present them in such a way that could be understood by
those who will benefit from the study. The most useful method of presenting the data is by constructing
statistical charts and graphs. There are many different types of charts and graphs, and each one has a
specific purpose. In Chapter 2, you will learn the statistical methods of organizing and presenting data.

2.2 Organizing Data


The most convenient method of organizing data is to construct a frequency distribution. This section
explains how to organize qualitative and quantitative data using frequency distribution.

There are three types of frequency distributions categorical (qualitative data) or ungrouped (quantitative
data) or grouped frequency distribution (quantitative data).

A frequency distribution is the organisaton of raw data in table form using classes and frequencies.

2.2.1 Categorical Frequency Distribution


To organize qualitative (or categorical) data we have to construct a categorical frequency distribution.

A categorical frequency distribution lists all categories and the number of elements that belong to
each of the categories.

Chapter 2: Frequency Distributions and Graphs 14


The following example illustrates how a categorical frequency distribution table is constructed.

EXAMPLE 2−1

A sample of 30 children from a primary school was selected, and was asked what their favourite fruit was.
They were given 3 options: apples, oranges and bananas. Their response is given below:

orange apple banana banana apple orange


apple orange banana apple orange apple
banana apple apple banana orange banana
apple banana banana orange apple orange
orange apple orange banana banana banana

Construct a frequency distribution table for these data.

SOLUTION

Step 1: Choose the categories/classes for the distribution. Since there are 3 options: apples, oranges,
and bananas, these will be the categories/classes.
Step 2: Make a table as shown:

Favourite Fruit Tally Frequency ( f )


Apple
Banana
Orange

Step 3: Tally the data and put the results in the tally column.
Step 4: Count the tallies and place the results in the frequency column.
Step 5: Find the total for the frequency column. The completed table is shown.

Favourite Fruit Tally Frequency ( f )

Apple //// //// 10

Banana //// //// / 11

Orange //// //// 9

f  30

It is easier to gather information from the categorical frequency distribution than the raw data. That is, it
can be concluded that
 Most children prefer banana.
 20 children prefer banana or orange.

Chapter 2: Frequency Distributions and Graphs 15


It is sometimes important to compute the relative frequency and its percentage in a frequency
distribution.
Relative frequency is calculated by dividing the frequency of the class by total frequency.

Frequency of the class f


Relative frequency of a class  
Total frequency f
The percentage for a class is obtained by multiplying the relative frequency of that class by 100.
Percentage  Relative frequency  100%

The relative frequency and percentage distributions of Example 2-1 are given below:

Favourite Fruit Frequency ( f ) Relative Percentage


frequency
Apple 10 10/30=0.33 33

Banana 11 11/30=0.37 37

Orange 9 9/30=0.30 30

Total f  30 1 100

From the relative frequency and the percentage distribution, the following information can be obtained:
 The relative frequency of apple is 0.33, which means that 33% of the children prefer apple.
 70% of the children prefer banana or apple.

2.2.2 Ungrouped Frequency Distribution


To organize quantitative (or numerical) data where the raw data is not grouped we have to construct an
ungrouped frequency distribution.

An ungrouped frequency distribution lists all categories and the number of elements that belong to
each of the categories.

The following example illustrates how an ungrouped frequency distribution table is constructed.

Chapter 2: Frequency Distributions and Graphs 16


EXAMPLE 2−2

A group of 24 customers of a popular restaurant were asked on their reviews of the quality of service.
They had to rate the service provided by the restaurant on the scale of 1−10. Below are their ratings:

10 6 7 8 4 1 7 6
9 10 8 2 3 3 6 5
1 4 5 7 6 10 9 6
Construct a frequency distribution table for these data.

SOLUTION
Choose the classes for the distribution. Since the rating is on the scale 1−10 these will be the classes.
The procedure for constructing ungrouped frequency distribution is same as categorical frequency
distribution. The complete ungrouped frequency distribution table is shown below with the relative
frequencies and percentages.

Rating( x ) Tally Frequency ( f ) Relative Percentage


frequency
1 // 2 2/24 = 0.083 8.3

2 / 1 1/24 = 0.042 4.2

3 // 2 2/24 = 0.083 8.3

4 // 2 2/24 = 0.083 8.3

5 // 2 2/24 = 0.083 8.3

6 //// 5 5/24 = 0.208 20.8

7 /// 3 3/24 = 0.125 12.5

8 // 2 2/24 = 0.083 8.3

9 // 2 2/24 = 0.083 8.3

10 /// 3 3/24 = 0.125 12.5

Total f  24 1 100

From the frequency distribution, the following information can be obtained:


 The highest rating is 10 and the lowest is 1.
 5 customers rated above 8.
 7 customers rated below 5.
 Most popular rating was 6.
 24 customers took the survey.

Chapter 2: Frequency Distributions and Graphs 17


 The relative frequency of rating 4 is 0.083, which means that 8.3% of customers gave the rating
4.
 20.8% customers rated above 8.

2.2.3 Grouped Frequency Distribution


If the number of distinct data values is large, the data must be grouped to make them more
comprehensible. We divide all the data into a small number of intervals, usually of equal width. These
intervals are called classes (class limits or class intervals).

A grouped frequency distribution organizes numerical data where the raw data is grouped using
class intervals of equal width.

To give an example of a grouped frequency distribution, let us consider the weights (in kg) of 50 pieces
of luggage with class intervals as follows:
Weight (kg) Class No. of pieces
Boundaries
7− 9 6.5 − 9.5 2

10 −12 9.5 – 12.5 8

13 − 15 12.5 – 15.5 14

16 – 18 15.5 – 18.5 19

19 − 21 18.5 – 21.5 7

Total 50

From this, we note the following:


1. The intervals of weights in the first column, 7–9, 10–12, …, 19–21 are known as class intervals.
2. The first number in a class interval 7, 10, …, 19 are called lower class limits of the respective
classes.
3. The second number in the class interval 9, 12, …, 21 are called upper class limits of the respective
classes.
4. The intervals 6.5–9.5, 9.5–12.5, 12.5–15.5, 15.5–18.5 and 18.5–21.5 are known as class
boundaries. The first numbers in a class boundary is called lower class boundary and second
number is called the upper class boundary. These class boundaries are obtained by
d
Lower class boundary  lower class limit  
2
d
Upper class boundary  upper class limit  
2
where d is the difference between any two consecutive classes. Here, d  1  d / 2  05

Chapter 2: Frequency Distributions and Graphs 18


5. The numbers in the third column 2, 8, 14, 19 and 7 are called frequency which gives the number of
data values in a particular class interval.
6. The class width or class size is the difference between the upper and lower class boundaries of a
class interval. For example, the class width for the class interval 13–15 is 15.5–12.5 = 3.
7. The class mark (or midpoint), xm , of a class interval is obtained by
lower class boundary  upper class boundary
xm  or
2

lower class limit  upper class limit


xm  .
2

The class boundaries are used to separate the classes so that there are no gaps in the frequency
distribution.

To construct a grouped frequency distribution, follow the rules:

1. There should be between 5–20 classes.


2. It is preferable, but not compulsory that the class width be an odd-number. This ensures the midpoint
of the class has the same decimal place as the data.
3. The classes must be mutually exclusive i.e. classes must not overlap.
4. The classes must be continuous. Even if there are no values in a class, the class must be included
in the frequency distribution.
5. The classes must be exhaustive i.e. there should be enough classes to accommodate all the data.
6. All classes must be equal in width.

EXAMPLE 2−3

Peter picked 40 leaves from a mango tree and measured their lengths in centimetres. He collected the
following data:

19, 16, 13, 17, 7, 8, 4, 18, 10, 17, 18, 9, 12, 5, 9, 9, 16, 1, 8, 17
1, 10, 5, 9, 11, 15, 6, 14, 9, 17, 1, 12, 5, 16, 4, 16, 8, 15, 14, 17

Construct a frequency distribution for the data using 5 classes.

SOLUTION

Step 1: Determine the class intervals.


 Find the highest and lowest values: H=19 and L=1.
 Find the range = highest value – lowest value = 19–1=18.
 Select the number of classes desired (usually between 5 and 20). In this case, 5 is arbitrary
chosen.
 Calculate class width (class size), by dividing the range by number of classes.
 Class width = 18 / 5 = 3.6. Then, round the class width up to the nearest whole number, hence
the class width is 4.

Chapter 2: Frequency Distributions and Graphs 19


 Now select a starting point for the 1st class, and this can be the smallest data value or any
convenient number less than the smallest data value. 0 is used for this case. Since the class
width is 4, the lower class limits are 0, 4, 8, 12, and 16. The upper limit of the 1st class is calculated
by subtracting 1 from the lower class limit of the 2nd class. So the upper limits are 3, 7, 11, 15
and 19. So the class limits are 0−3, 4−7, 8−11, 12−15, 16−19. Find the class boundaries, since
d  1  d / 2  05, the class boundaries are −0.5−3.5, 3.5−7.5, 7.5−11.5, 11.5−15.5 and
15.5−19.5.

Step 2: Tally the data.


Step 3: Find the numerical frequencies from the tallies.
Step 4: Find the total for the frequency column. The completed table is shown.

Length (cm) Class boundaries Tally Frequency


0−3 −0.5 – 3.5 /// 3

4 −7 3.5 – 7.5 //// // 7

8 − 11 7.5 – 11.5 //// //// / 11

12 –15 11.5 – 15.5 //// // 7

16 − 19 15.5 – 19.5 //// //// // 12

Total 40

Note: Relative frequencies and percentage can be calculated similarly as before.

EXAMPLE 2−4

The table provides the distribution of the ages of new employees joined at a factory.

Age No. of employees


20 − 29 7

30 − 39 21

40 − 49 4

50 − 59 2

60 − 69 1

A. Obtain the class boundaries and class marks of the class intervals.
B. What is the upper class limit of the class 30 – 39?
C. What is the lower class boundary of the class 50 – 59?
D. What is the class mark of the class 40 – 49?

Chapter 2: Frequency Distributions and Graphs 20


SOLUTION

A. The class boundaries and class marks are given in the following table:
Class interval Class boundary Class mark ( xm ) Frequency ( f )
20 − 29 19.5 – 29.5 24.5 7
30 − 39 29.5 – 39.5 34.5 21
40 − 49 39.5 − 49.5 44.5 4
50 − 59 49.5 – 59.5 54.5 2
60 − 69 59.5 – 69.5 64.5 1

B. 39
C. 49.5
D. 44.5

2.3 Graphical Presentation of Data


Once the raw data has been organized into frequency distribution tables we turn to present statistical
data by using statistical charts and graphs. They enable us to visualize the whole meaning of a complex
data at a single glance.
We will first look at the graphical presentation of the qualitative (categorical) data. Some of the
graphs/charts by which we can present qualitative data are:
 Bar graphs
 Pareto charts
 The time series graphs
 Pie charts

2.3.1 Bar Graphs


Bar graphs are used to represent the qualitative or the categorical data. Bar graphs can either be drawn
using vertical or horizontal bars.

A bar graph represents the data by using vertical or horizontal bard whose heights represent the
frequency of the respective categories.

EXAMPLE 2−5

The given data represents the average amount of money spent by first year college students. Construct
a bar graph for the data.

Food $765

Clothing $443

Text Books $523

Technical Gadgets $855

Chapter 2: Frequency Distributions and Graphs 21


SOLUTION

Step 1: Draw and label the x and y axis. For the vertical bar graph, place the frequency scale on the y
axis.

Step 2: Draw the vertical bars corresponding to the frequencies.

Bar Graph
1,000
800
Amount

600
400
200
0
Food Clothing Text Books Technical
Gadgets
Type of spending

2.3.2 Pareto Charts


When the variable displayed on the horizontal axis is qualitative or categorical, Pareto chart can be used
to present data.

A Pareto chart is used to present categorical data and the frequency are displayed by heights of
vertical bars, which are arranged in order from highest to lowest.

EXAMPLE 2−6

Construct a Pareto chart for the data given in Example 2−5.

SOLUTION

Step 1: Arrange the data from largest to smallest according to the frequency.

Technical Gadgets $855

Food $765

Text Books $523

Clothing $443

Step 2: Draw and label the x and y axis.

Chapter 2: Frequency Distributions and Graphs 22


Step 3: Draw the vertical bars corresponding to the frequencies.

Pareto Chart
900
800
700
600
Amount

500
400
300
200
100
0
Technical Gadgets Food Text Books Clothing
Type of Spending

2.3.3 Time Series Graphs


Time series graphs are used to represent data collected over a period of time.

A time series graph represents data that occur over a specific period of time.

EXAMPLE 2−7

The data below shows the number of athletes’ participating in a five-day athletics tournament organized
by the Oceania Sports Council. Construct a time series graph.

Day No. of Athletes’


Monday 25

Tuesday 14

Wednesday 22

Thursday 36

Friday 43

Step 1: Draw and label the x and y axes.


Step 2: Plot each point on the graph according to the data.
Step 3: Draw line segments connecting adjacent points.

Chapter 2: Frequency Distributions and Graphs 23


Time series Graph
50
45
No. of atheletes 40
35
30
25
20
15
10
5
0
Monday Tuesday Wednesday Thursday Friday
Day

2.3.4 Pie Charts


Pie charts are most commonly used in statistics. The purpose of a pie chart is to show the relationship of
parts to the whole by visually comparing the sizes of the different sections.

A pie chart is a circle that is divided into sections or wedges according to the percentage of
frequencies in each category of the distribution.

EXAMPLE 2−8

This frequency distribution shows the preference of drink by people in a cocktail party. Construct a pie
chart for the data.
Response Frequency
Red Wine 77

Whiskey 48

Tribe 65

Total 190
SOLUTION

Step 1: Since there are 360 in a circle, the frequency for each class must be converted to degrees.
f
Degrees   360
n
Step 2: Each frequency must also be converted to a percentage.
f
Percentage   100%
n

Chapter 2: Frequency Distributions and Graphs 24


Step 3: Using a protractor and compass, draw the graph using the appropriate degree measures.
The table below shows the computations of degrees and percentages for each category.

Response Frequency Degree Percentage


Red Wine 77 77 77
× 360° = 146° × 100% = 41%
190 190
Whiskey 48 48 48
× 360° = 90° × 100% = 25%
190 190
Tribe 65 65 65
× 360° = 124° × 100% = 34%
190 190
Total 190 360o 100%

Pie Chart

Tribe
34%
Red Wine
41%

Whiskey
25%

We will now look at the graphical presentation of quantitative (numerical) data. Some of the
graphs/charts by which we can present quantitative data are:

 Histograms
 Frequency polygons
 Ogive or cumulative frequency graphs

2.3.5 Histograms
A histogram is the most commonly used graph to represent a quantitative data. The horizontal axis ( x −
axis) represents the data (or class boundaries) and the vertical axis ( y −axis) represents the frequency.

A histogram is a graph that displays the data by using contiguous vertical bars of various heights to
represent the frequencies of the class.

Chapter 2: Frequency Distributions and Graphs 25


EXAMPLE 2−9

The data below represents the number of items rejected daily by a manufacturer because of defects was
recorded for the last 25 days. Construct a histogram.

Items Rejected Frequency


6−10 5
11−15 3
16−20 9
21−25 7
26−30 1

SOLUTION
Step 1: Draw and label the x and y axes.
Step 2: Represent the frequency on the y axis and the class boundaries on the x axis.
Step 3: Using the frequencies as the heights, draw vertical bars for each class.

Histogram
10

8
frequency

0
5.5-10.5 10.5-15.5 15.5-20.5 20.5-25.5 25.5-30.5
items rejected

2.3.6 Frequency Polygons


Another way of representing the same data set is by using a frequency polygon.

A histogram is a graph that displays the data by using contiguous vertical bars of various heights
to represent the frequencies of the class. The frequency polygon is a graph that displays the data
by using line that connect points plotted for the frequencies at the midpoints of the class. The
frequencies are represented by the heights of the points.

Chapter 2: Frequency Distributions and Graphs 26


EXAMPLE 2−10

Construct a frequency polygon for the data given in example in 2−9.

SOLUTION

Step 1: Calculate the midpoints of each class.

Items Rejected Midpoints Frequency


6−10 8 5
11−15 13 3
16−20 18 9
21−25 23 7
26−30 28 1

Step 2: Draw the x and y axes. Label the x axis with the midpoints of each class and the y axis for the
frequencies.

Step 3: Using the midpoints for the x values and the frequencies as the y values, plot the points.

Step 4: Connect adjacent points with the line the segments.

Frequency Polygon
10
8
Frequency

6
4
2
0
3 8 13 18 23 28 33
items rejected

Note: The frequency polygon should always touch the x-axis.

Chapter 2: Frequency Distributions and Graphs 27


2.3.7 Ogives
Ogive is also called the cumulative frequency graph. This type of graphs can be used to represent
cumulative frequencies for the classes. The cumulative frequency is the sum of the frequencies
accumulated up to the upper boundary of a class.

EXAMPLE 2−11

Construct an ogive for the data given in example in 2−9.

SOLUTION

Step 1: Calculate the cumulative frequency for each class.

Class Boundaries Frequency Cumulative Frequency


5.5−10.5 5 5
10.5−15.5 3 8
15.5−20.5 9 17
20.5−25.5 7 24
25.5−30.5 1 25

Step 2: Draw and label the x and y axes. The cumulative frequencies will go on the y-axis and the upper
class boundaries will go on the x-axis.

Step 3: Using the upper class boundaries for the x values and the cumulative frequencies as the y values,
plot the points.

Ogive
30
cumulative frequency

25
20
15
10
5
0
5.5 10.5 15.5 20.5 25.5 30.5
items rejected

Chapter 2: Frequency Distributions and Graphs 28


2.3.8 Relative Frequency Graphs
The other way of representing data is to use relative frequencies instead of frequencies. These types of
graphs are called relative frequency graphs. The graphs are similar to the ones use raw data as
frequency, but the values on the y axis are in term of proportions.

EXAMPLE 2−12

Construct a histogram, frequency polygon, and an ogive using relative frequencies for the distribution of
the weights of 50 randomly selected ST130 students.

Class limits Frequency


30−39 5
40−49 10
50−59 18
60−69 12
70−79 5
Total 50

SOLUTION

Step 1: Calculate the class boundaries, class midpoints, relative frequency and cumulative relative
frequency.

Class limits Frequency Class Midpoints Relative Cumulative


boundaries frequency relative
frequency
30−39 5 29.5−39.5 34.5 0.10 0.10
40−49 10 39.5−49.5 44.5 0.20 0.30
50−59 18 49.5−59.5 54.5 0.36 0.66
60−69 12 59.5−69.5 64.5 0.24 0.9
70−79 5 69.5−79.5 74.5 0.10 1.0
Total 50

Step 2: Draw the graphs.

 The histogram will be drawn using class boundaries in x-axis and relative frequency in y-axis.
 The frequency polygon will be drawn using midpoints in x-axis and relative frequency in y-axis.
 The ogive will be drawn using upper class boundaries in x-axis and cumulative relative frequency
in y-axis.

Chapter 2: Frequency Distributions and Graphs 29


Histogram
0.4
0.35
relative frequency 0.3
0.25
0.2
0.15
0.1
0.05
0
29.5-39.5 39.5-49.5 49.5-59.5 59.5-69.5 69.5-79.5
Weights

Frequency Polygon
0.4
relative frequency

0.3

0.2

0.1

0
24.5 34.5 44.5 54.5 64.5 74.5 84.5
weights

Ogive
1.2
cumulative relative frequency

1
0.8
0.6
0.4
0.2
0
29.5 39.5 49.5 59.5 69.5 79.5
weights

Chapter 2: Frequency Distributions and Graphs 30


2.4 Distribution Shapes
When one is describing data, it is important to be able to recognize the shape of the distribution. A
frequency distribution curve obtained by histogram or frequency polygon can assume any one of a large
number of shapes. The most common of these shapes are:
 Symmetric;
 Skewed; and
 Uniform or rectangular.

The following graphs illustrate the general shape of the distribution:


A positively skewed frequency curve A negatively skewed frequency curve

A uniform or rectangular frequency curve A symmetric frequency curve

Symmetric Frequency Curve: It is approximately identical on both sides of a line running through
the center. This type of distribution is known as bell-shaped
distribution.

Skewed Frequency Curve: A non-symmetrical frequency curve is known as skewed curve.


When the peak of a curve is to the left and a longer tail on the right
side, the curve is said to be right-skewed. When the curve has a
longer tail on the left side and peak on right side, it is said to be left-
skewed.

Uniform Frequency Curve: If a curve has the same frequency for each class, then it is said to
be uniform or rectangular curve.

Chapter 2: Frequency Distributions and Graphs 31


2.5 Stem and Leaf Plots
Stem and leaf plots are used to organize quantative data when sorting and graphing are both important.
It has the advantage over grouped frequency distribution of retaining the actual data while showing them
in a graphical form.

A stem and leaf is a data plot that uses part of the data value as the stem and part of data value as
the leaf to form groups or classes.

EXAMPLE 2−13

At an outpatient-testing center, the number of cardiograms performed each day for 20 days is shown
below. Construct a stem-and-leaf plot for the data.

25 31 20 32 13 14 43 02 57 23
36 32 33 32 44 32 52 44 51 45

SOLUTION

To construct a stem-and-leaf plot for the above data, we follow these steps:

Step 1: Arrange the data in ascending order:

02, 13, 14, 20, 23, 25, 31, 32, 32, 32, 32, 33, 36, 43, 44, 44, 45, 51, 52, 57

Step 2: Separate the data according to the first digit.

02
13, 14
20, 23, 25
31, 32, 32, 32, 32, 33, 36
43, 44, 44, 45
51, 52, 57

Step 3: Using the unit (trailing) digit values as leaves, the corresponding stem-and- leaf plot is shown
below:

Stem Leaf
0 2
1 34
2 035
3 1222236
4 3445
5 127

By looking at this stem-and-leaf display, we can observe how the data values are distributed. For
example, the stem 3 has the highest frequency, followed by stems 4, 2, 5, 1, and 0.

Chapter 2: Frequency Distributions and Graphs 32


Note:
 A stem and leaf plot is a method to organize statistical data.
 When the data values are in the hundreds, such as 325, the stem is 32 and the leaf is 5.
 When you analyze a stem and leaf plot, look for peaks and gaps in the distribution. See if the
distribution is symmetric or skewed.
 Stem and leaf plots are part of the techniques called exploratory data analysis.

2.6 Summary
This chapter focused on statistical technique of organizing and presenting of data. The data was
organized using a frequency distribution table and presented using various graphs such as bar graph
Pareto charts, time series graphs, pie charts, histogram, frequency polygon and ogive. We also learnt to
recognize the shape of the frequency distributions and construct stem and leaf plots.

EXERCISES

1. Twenty-five army inductees were given a blood test to determine their blood type. The following data
was obtained:

A B B AB O O O B AB B
B B O A O A O O O AB
AB A O B A

A. Construct a categorical frequency distribution for the data.


B. Calculate the relative frequencies and percentages for all categories.
C. What percentage of the elements in this sample belongs to category A or O?
D. Construct a pie chart for the percentage distribution.
E. Draw a bar graph for the frequency distribution.

2. The amount of protein (in grams) for a variety of fast food sandwiches is reported here.

23 30 20 27 44 26 35 20 29 29
25 15 18 27 19 22 12 26 34 15

A. Construct a grouped frequency distribution using 5 classes.


B. Calculate the relative frequencies and percentages for all classes.
C. Construct a histogram and a frequency polygon for the frequency distribution.
D. Construct a cumulative frequency graph.

3. The results from a statistics exam are as follows:

75 66 77 66 64 73 91 65 59 86 61 86 61 58 70
77 80 58 95 78 62 79 83 54 52 45 82 48 67 55

A. Construct a stem-and-leaf display for these data.


B. What proportion of the marks is less than 70?
C. In which interval of 10s did most students score?
D. Is the distribution of marks in 10s symmetric or skewed? Explain.

Chapter 2: Frequency Distributions and Graphs 33


CHAPTER 3:

DATA DESCRIPTION

Chapter 3: Data Description 34


Overview
This chapter discusses how data can be described using statistical methods. The concepts discussed in
this chapter are as follows: measure of central tendency; measure of variation; measure of position;
outliers; exploratory data analysis. The chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Describe data, using measures of central tendencies, such as mean, median, mode and
midrange.
2. Describe data, using measures of variations, such as range, variance and standard deviation.
3. Identify the position of a data value in a data set, using various measures of position, such as
standard scores, percentiles, deciles and quartiles.
4. Check for outliers in a data set.
5. Use the techniques of exploratory data analysis, including boxplots to discover the nature of the
data.

3.1 Introduction
In Chapter 2, we have seen how one can analyse the raw data by organizing it into a frequency
distribution and the presenting the data by using various graphs. Organizing the presenting alone is not
enough to describe data meaningfully so we will now examine some statistical methods that can be used
to describe the data. The methods include measures of central tendency, measures of variation and
measures of position.

The measure of average or the measure of central tendencies is numerical measures that locate the
center of the dataset. Measures of central tendency include mean, median, mode, midrange and
weighted mean.

Knowing the average such as mean, median and mode is not enough to describe the dataset entirely,
therefore the measure of variation or dispersion is studied. The measure of variation or dispersion is
numerical measures that determine the spread of data values from the center. Measures of variation
include range, variance, and standard deviation.

In addition to measure of central tendency and measure of variation, there are measures of position or
location. They are used to locate the relative position of the data value in the dataset. Measures of position
include percentiles, deciles and quartiles. These measures are used extensively in psychology and
education and sometimes they are referred to as norms.

3.2 Measures of Central Tendency


The measures of central tendencies (also known as measures of average) are numerical measures that
locate the center of the dataset. In other words, this measure is to find a single value, which enables us
to get an idea of the entire set of data. Measures of central tendency also enable us to facilitate
comparison between two or more sets of data.

The types of measures of central tendency that will be discussed in this section are mean, median, mode,
midrange and weighted mean.

Chapter 3: Data Description 35


Recall when the population is small, it is not suitable to use samples since the entire population can be
used to gain information. However, if the population is infinite we make use of samples and then
generalize from samples to populations. Therefore, it is important to know the following terms:

A parameter is a characteristic or measure obtained by using all the data values from an entire
population.

A statistic is a characteristic of measure obtained by using all the data values from a specific sample
chosen from a large population.

General Rounding Rule: When computations are done in statistics, the basic rounding rule is that,
rounding should not be done until the final answer is calculated. If rounding is done in every step along
the way, it tends to increase the difference between that answer and the exact one.

3.2.1 The Mean


The mean (arithmetic average), is calculated by adding all the data values and then dividing by the total
number of values. For example, the mean of the dataset 3, 2, 6, 5 and 4 is found by adding
3+2+6+5+4=20 and dividing by 5; hence the mean of the data is 20/5=4.

The symbol X represents the sample mean and  represents the population mean.

Formulas to Compute Mean

We use the following formulas summarized in the table below to compute the mean:

Raw data Ungrouped frequency Grouped frequency


distribution distribution

Sample
X 
X X 
 fX X 
 fX m

n n n
Population

X 
 fX 
 fX m

N N N

Where,
n is the sample size
N is the population size
f is the frequency of a class
X m is the midpoint of a class interval

 X is the sum of all data values


 fX is the sum of frequency multiplied with the data value of each class

Chapter 3: Data Description 36


EXAMPLE 3−1

The data given below represents the marks scored by a sample of 11 students selected from a particular
English class. Find the mean mark.

67, 89, 49, 55, 87, 79, 72, 69, 81, 52, 91

SOLUTION
Since the dataset represents the sample and is a raw data, the mean is given by:

X 
X 
67  89   91

791
 719
n 11 11
Hence, the mean mark is 71.9

Rounding Rule for the Mean. The mean should be rounded to one more decimal place than it occurs
in the raw data.

EXAMPLE 3−2

Using the frequency distribution as in Example 2-2 of Chapter 2, find the mean.

SOLUTION

Step 1: Make a table as shown.

Rating( X ) Frequency ( f ) fX

1 2

2 1

3 2

4 2

5 2

6 5

7 3

8 2

9 2

10 3

Total n = 24

Chapter 3: Data Description 37


Step 2: Multiply the frequency with the data value of each class and enter them in the 3 rd column.

Step 3: Find the sum of the values in the 3rd column. The completed table is shown below.

Rating( X ) Frequency ( f ) fX
2
1 2
2
2 1
6
3 2
8
4 2
10
5 2
30
6 5
21
7 3
16
8 2
18
9 2
30
10 3

Total n = 24  fX = 143
Step 4: Divide the sum of 3rd column by n to get the mean.

X 
 fX 
143
 5.96
n 24

EXAMPLE 3−3

The following is the distribution of the number of fish caught by all 50 fishermen in a coastal area. Find
the mean number of fish caught by a fisherman.
No. of fishermen No. of fishermen
11 − 15 12

16 − 20 14

21 − 25 13

26 − 30 11

Chapter 3: Data Description 38


SOLUTION

Step 1: Make a table as shown.

No. of fish caught No. of fishermen ( f ) Midpoints ( X m ) fX m


11 − 15 12

16 − 20 14

21 − 25 13

26 − 30 11

n = 50

Step 2: Find the midpoint of each class and enter them in the 3rd column.

Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4 th column.

Step 4: Find the sum of the values in the 4th column. The completed table is shown below.

No. of fish caught No. of fishermen ( f ) Midpoints ( X m ) fX m


11 − 15 12 13 156

16 − 20 14 18 252

21 − 25 13 23 299

26 − 30 11 28 308

n = 50  fX m = 1015

Step 5: Divide the sum of 4th column by N to get the mean.


 fX m

1015
 20.3
N 50

3.2.2 The Median


The median is the midpoint of the data set. To calculate the median, it is necessary to arrange the data
in order. The median can either be a specific value in the data set or can fall between two values.

The median is the midpoint of the data set when the data is arranged in order.

Chapter 3: Data Description 39


EXAMPLE 3−4

The numbers of comics purchased on a particular day by nine school students are given below.

3, 7, 10, 5, 9, 4, 11, 7, 2
Find the median.

SOLUTION

Step 1: Arrange the data in order


2, 3, 4, 5, 7, 7, 9, 10, 11

Step 2: Select the middle point.


2, 3, 4, 5, 7, 7, 9, 10, 11
Hence, the median is 7 comics.

EXAMPLE 3−5

The numbers of tropical cyclones in the Pacific over the 8–year period is as follows.

687, 576, 702, 405, 237, 899, 799, 907


Find the median.

SOLUTION

Step 1: Arrange the data in order.


237, 405, 576, 687, 702, 799, 899, 907

Step 2: Select the middle point.


237, 405, 576, 687, 702, 799, 899, 907
Since there are two values in the middle point, we add the two values and divide by 2, to find the median.

687  702
The median number of tropical cyclones is  694.5 .
2

EXAMPLE 3−6

Estimate the median of the data in given Example 3−3.

SOLUTION

Step 1: Find the class boundaries, cumulative frequency and cumulative percentage for each class.

cumulative frequency
cumulative percentage   100
Total frequency
The table is shown below:

Chapter 3: Data Description 40


Class boundaries Frequency Cumulative frequency Cumulative percentage
10.5 – 15.5 12 12 12
 100  24
50

15.5 – 20.5 14 26 26
 100  52
50

20.5 – 25.5 13 39 78

25.5 – 30.5 11 50 100

50

Step 2: Using the upper class boundaries for the x values and the cumulative percentage as the y values,
plot the points. This type of ogive is called a Percentile Graph.

Percentile Graph
100
cumulative percentage

90
80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught

To estimate the median, find the x−value corresponding to the y-value of 50 from the percentile graph.
So the median is estimated to be 20.

3.2.3 The Mode


The mode is the third measure of central tendency. It is the value that occurs most often in a data set.
Note:
 A data set that has only one value that occurs most often is said to be unimodal.
 If a data set has two values that occur most often, both values are considered to be the mode
and the data set is said to be bimodal.
 If a data set has more than two values that occur most often, each value is used as the mode,
and the data set is said to be multimodal.
 A data set where no data value occurs more than once, the data set is said to have no mode.
 If data is grouped in class intervals, then the interval that has the highest frequency is called the
modal class and its midpoint is called the crude mode.

Chapter 3: Data Description 41


EXAMPLE 3−7

Find the mode of the transfer fees of 9 professional soccer players for a specific year. The transfer fee in
millions of dollars is: 1.2, 12.0, 4.5, 6.1, 8.3, 4.5, 7.2, 11.0, 4.5

SOLUTION

Since $4.5 million occurred 3 times (most often), the mode is $4.5 million.

EXAMPLE 3−8

Find the mode for the following sets of data:


A. 40, 44, 57, 78, 48
B. 45, 55, 50, 45, 40, 55, 45, 55

SOLUTION

A. Since each value occurs only once, there is no mode. (Do not say that the mode is zero).
B. Since both 45 and 55 occur most often (3 times each), the modes are 45 and 55. This set of data
is said to be bimodal.

EXAMPLE 3−9

Find the mode of the frequency distribution in Example 3-3.

SOLUTION

The modal class is 16 – 20, as it has the highest frequency. Note: In many cases, the measures of central
tendency may have significantly different values. One has to be very cautious in using these measures.

EXAMPLE 3−10

A small company consists of the owner, the manager, salesperson and two technicians, all of whose
annual salaries are listed below. Find the mean, median and mode.

Staff Salary ($)


Owner 50,000
Manager 20,000
Salesperson 12,000
Technician 9,000
Technician 9,000

Chapter 3: Data Description 42


SOLUTION

Here the mean is $20,000, the median is $12,000 and the mode is $9,000. The mean is much higher
than median and mode because the extremely high salary of the owner. In such situations, the median
should be used as the measure of central tendency.

3.2.4 The Midrange


The midrange (MR) is a rough estimation of the middle. It is found by adding the lowest and the highest
values in the data set and dividing the result by 2. It can be affected by extreme values in the dataset.

lowest value +highest value


MR 
2

EXAMPLE 3−11

Find the midrange of the data in example 10.

SOLUTION

9000 +50000
MR   29,500
2

Hence, the midrange is 29,500. The midrange is affected by extreme value of $50,000 in the dataset.

Note: In statistics, several measures can be used for an average. The most common measures are
mean, median, mode and midrange. Each has its own specific purpose and use. The median is a better
measure when there are extreme values in the dataset. 3−10

3.2.5 The Weighted Mean


The weighted mean is used when we wish to place greater emphasis on some of the values in the data
set. In such situation, it may not be suitable to calculate an ordinary mean. This type of mean that
considers additional factor is called the weighted mean.

The weighted mean of the data set x1 x2 … xn with respective weightings w1  w2 … wn , is given by

Weighted mean 
w1 x1  w2 x2   wn xn

w x .
i i

w1  w2   wn w i

The use of weighted mean is illustrated in the following example.

lowest value +highest value


MR 
2

Chapter 3: Data Description 43


EXAMPLE 3−12

In ST130, a student obtained the following marks in the continuous assessment:

Mid-semester test (MST): 67%


Assignment 1: 88%
Assignment 2: 94%
Final exam: 75%

The mid-semester test had a weight of 20%, assignments had a weight of 10% each and the final exam
has a weight of 60%.

Calculate the final mark of the student.

SOLUTION

As in regulation, the weights for the results are in the following ratio:

MST: Assignment 1: Assignment 2: Final Exam = 20% 10%: 10%: 60% = 2: 1: 1: 6

For awarding the final result, we have to take this weighting into account:

2(67)  1(88)  1(94)  6(75)


Weighted mean   76.6.
2 11 6

Therefore, the final mark is 77%.

3.2.6 Relationships among Mean, Median and Mode


If the values of the mean, median and mode are known, it can give us some idea about the shape of a
frequency distribution. Now we will discuss the relationships among the mean, median and mode for
symmetric, positively and negatively skewed distributions.

For a symmetric distribution with one peak,


the values of the mean, median and mode
are same, and they lie at the center of the
distribution.

Mode = Median = Mean

Chapter 3: Data Description 44


For a right skewed distribution, the value
of the mean is the largest, the mode is the
smallest, and the value of the median lies
between these two. Notice that the mode
always occurs at the peak point. The value of
the mean is the largest in this case because
it is sensitive to outliers that occur in the right
tail. These outliers pull the mean to the right.

Mode Median Mean

If a distribution is skewed to the left, the


value of the mean is the smallest and the
mode is the largest, with the value of the
median lying between these two. In this case,
the outliers in the left tail pull the mean to the
left.

Mean Median Mode

3.3 Measures of Variation


The measures of variation (also known as measures of dispersion) are numerical measures to determine
the spread of the data values from the central tendencies. Many times the measures of central tendency
alone cannot describe the data.

EXAMPLE 3−13

I wish to test two brands of outdoor paint to see how long each will last before fading. The results (in
months) are shown. Find the mean and median of each group. (Assume Population)

Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25

The mean and median for both brands of paint is 35 months. Since the mean and median for both brands
of paint is same, we cannot conclude which paint is better using these measures of central tendencies.

Chapter 3: Data Description 45


Therefore, to find out which paints lasts longer that is a better choice, the measure of variation is
important.

The types of measures of variation that will be discussed in this section are range, variance, and standard
deviation.

3.3.1 Range
The range is the simplest measure of variation and is defined as:

The range (R) is the highest value minus the lowest value in the data set. That is

R = Highest value – lowest value

EXAMPLE 3−14

Find the range for the two brands of paints given in Example 3−13.

SOLUTION

Brand A: The range R = 60 – 10 = 50 months.

Brand B: The range R = 45 – 25 = 20 months.

Since the range of Brand B is less it can be concluded that Brand B is less variable (more reliable or a
better choice) than Brand A.

Since range is not good measure of variability if there are extreme values in the dataset, statisticians use
other measures called the variance and standard deviation.

3.3.2 The Variance and Standard Deviation


The variance is defined as the average of the squares of the deviation of each data value from the mean.
It is denoted by  2 for population variance and s2 for sample variance.

The corresponding formulas used to calculate these variances of raw data are

2 
( X   ) 2
and s 2 
( X  X ) 2
,
N n 1

Where,


 X and X   X
N n

Chapter 3: Data Description 46


The standard deviation is the most commonly used measure of dispersion. The value of the standard
deviation tells how closely the values of a data set are clustered around the mean. Standard deviation is
found by taking square root of the variance. It is denoted by  for population standard deviation and s
for sample standard deviation.

EXAMPLE 3−15

Find the variance and standard deviation for Brand A paint data given in Example 3−13.

SOLUTION

Step 1: Find the mean.


X 
210
 35
N 6
Step 2: Subtract the mean from each data value and square each result. The completed table is shown
below.

Brand A (X) ( X   )2
10 (10 – 35)2 = 625
60 (60 – 35)2 = 625
50 225
30 25
40 25
20 225

Step 3: Find the sum of 2nd column.

 ( X  ) 2
 625  625  225  25  25  225  1750

Step 4: Find the variance.

2 
( X   ) 2

1750
 291.7
N 6

Step 5: Find the standard deviation.


  291.7  17.1

Remarks:
1. The variance and standard deviation of Brand B paint is 41.7 and 6.5 respectively.
2. Since the standard deviation of Brand B is less, one can conclude that brand B is less variable (more
reliable or a better choice) than Brand A.

Chapter 3: Data Description 47


3. There are shortcut formulas for computing variance and standard deviation and is summarized in the
table below:
Raw data Ungrouped frequency Grouped frequency
distribution distribution
Sample
 X    fX   f X 
2 2 2

X   fX  f X 
2 2 2 m
m
s 
2 n s 
2 n s2  n
n 1 n 1 n 1
Population  X    fX   f X 
2 2 2

X  fX  f X 
m

2 2 2
m
N 2  N 2  N
  2

N N N

Note: Always use the shortcut formulas to compute variance and standard deviation.

EXAMPLE 3−16

Find the variance and standard deviation for Brand A paint data given in Example 3−13 using the shortcut
formula.

SOLUTION

Step 1: Find the sum of all the data values.

Step 2: Square each data value and enter them in the 2nd column

Step 3: Find the sum of 2nd column.

Brand A ( X ) X2
10 100
60 3600
50 2500
30 900
40 1600
20 400

 X  210 X
2
 9100

Step 4: Find the variance.


2102
9100 
2  6  291.7
6

Chapter 3: Data Description 48


Step 5: Find the standard deviation.
  291.7  17.1

EXAMPLE 3−17

Find the variance and standard deviation of the number of fish caught using the data in Example 3−3.

SOLUTION

Step 1: Make a table as shown.

No. of fish caught No. of fishermen ( f ) Midpoints ( X m ) fX m f X m2


11 – 15 12

16 – 20 14

21 – 25 13

26 – 30 11

n = 50

Step 2: Find the midpoint of each class and enter them in the 3rd column.

Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4 th column. Find
the sum of the values in the 4th column.

Step 4: For each class, multiply the frequency with the square of the midpoints and enter them in the
5th column. Find the sum of the values in the 5th column. The completed table is shown below.

No. of fish No. of fishermen ( f ) Midpoints ( X m ) fX m f X m2


caught

11 – 15 12 13 12 × 13 = 156 12 × 132 = 2028

16 – 20 14 18 14 × 18 = 252 12 × 132 = 4536

21 – 25 13 23 299 6877

26 – 30 11 28 308 8624

n = 50  fX m  1015 f X 2
m  22065

Chapter 3: Data Description 49


Step 5: Find the variance.
10152
22065 
2  50  29.2
50

Step 6: Find the standard deviation.


  29.21  5.4

3.3.3 Coefficient of Variation


When two or more datasets have same units of measure, variance or standard deviation can be used to
measure the variability between the datasets. However, when the units of measure are different
coefficient of variation is used compare their variability.

The coefficient of variation, denoted by CV, is the standard deviation divided by the mean. The result
is expressed as a percentage.

For population  C V   100%

s
For sample  C V   100%
x
EXAMPLE 3−18

The mean of the number of sales of airplane engines over a 6-month period is 92, and the standard
deviation is 5. The mean of the commissions earned is $5255, and the standard deviation is $770.
Compare the variations of the two.

SOLUTION

The coefficients of variation are:


 5
For sales  CV    100%   100%  5.4%
 92
 770
For commission  CV    100%   100%  14.7%
 5255
Since the coefficient of variation is larger for commissions, the commissions are more variable than the
sales.

3.4 Measures of Position


The measures of position (also known as measures of location) are the numerical measures to determine
the relative position of a data value in a data set.

The types of measures position that will be discussed in this section are standard scores, percentiles,
deciles and quartiles.

Chapter 3: Data Description 50


3.4.1 Standard Scores
There is an old saying, “You can’t compare apples and oranges.” However, with the use of statistics, it
can be done to some extent. Suppose that a student scored 90 in mathematics test and 45 in English
test. Direct comparison of these raw scores is impossible, since the exams might not be equivalent in
terms of number of questions, value of each question, and so on. However, a comparison of a relative
standard similar to both can be made. This comparison uses the mean and standard deviation and is
called a standard score or z score.
A standard score or z score tells how many standard deviations a data value is above or below the mean
for a specific distribution of values. If the standard score is zero, then the data value is the same as the
mean.

A z score or standard score for a value is obtained by subtracting the mean from the value and dividing
the result by the standard deviation, i.e.
X 
For population  z  

X X
For sample  z 
s

EXAMPLE 3−19

A student scored 90 on Maths test that had a mean of 52 and a standard deviation of 10; he also scored
45 on an English test with a mean of 35 and a standard deviation of 5. Compare her relative positions on
the two tests.

SOLUTION

Step 1: Find the z scores.

XX 90  52
For Maths: z = z = 3.8
s 10
XX 45  35
For English: z = z = 2.0
s 5

The score for Maths test is higher than the score for English test.

3.4.2 Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the position of
an individual in a group.

Percentiles are data values that divide the dataset into 100 equal parts where the dataset should be in
an ascending order. Each set of observations has 99 percentiles and are denoted by P1  P2 … P99 .

Chapter 3: Data Description 51


The following figure describes the positions of the 99 percentiles.
Each of these portions contains 1% of the observations
of a data set arranged in increasing order

… … ...

1% 1% 1% … … ... 1% 1% 1%

P1 P2 P3 P97 P98 P99

Remarks:
1. P20 is called the 20th percentile, which indicates that 20% of the scores fall below P20 .
2. P50 is called the 50th percentile, which indicates that 50% of the scores fall below P50 .
P50  median.

Steps to Compute Percentile of Raw data


Step 1: Arrange the data from lowest to highest (ascending order).

Step 2: Find the k th percentile ( Pk ).


 kn 
Pk  value of the   th term
 100 
Where,
k is the number of percentile and n is the sample size.

Note:
1. To calculate quartiles and deciles of a raw data, convert them to percentiles and use the same
steps.
2. To estimate percentiles, deciles and quartiles of a raw data use a Percentile Graph.

Percentile Rank
We can calculate the percentile rank for a particular value x of a data set by using the formula:

Number of values less than x  0.5


Percentile rank of x   100%
Total number of values
Note:
1. A percentile is a value in the data set.
2. The percentile rank of a score indicates what percent of data lies below the score.

Chapter 3: Data Description 52


3.4.3 Deciles
Deciles are data values that divide the dataset into 10 equal parts where the dataset should be in an
ascending order. Each set of observations has 9 deciles and are denoted by D1  D2 … D9 .

The following figure describes the positions of the 9 deciles.

Each of these portions contains 10% of the observations


of a data set arranged in increasing order

… … ...

10% 10% 10% … … ... 10% 10% 10%

D1 D2 D3 D7 D8 D9

Remarks:
1. D4 is called the 4th decile, which indicates that 40% of the scores fall below D4 .
2. D5 is called the 5th decile, which indicates that 50% of the scores fall below
3. P50  D5  median.
4. D1  P10 ; D2  P20 ; D3  P30 ; D9  P90

3.4.4 Quartiles
Quartiles are data values that divide the dataset into 4 equal parts where the dataset should be in an
ascending order. Each set of observations has 3 quartiles and are denoted by Q1  Q2 and Q3 .

The following figure describes the positions of the 4 quartiles.

Each of these portions contains 25% of the observations


of a data set arranged in increasing order

Remarks:
1. Q1 is called the 1st quartile (or lower quartile), which indicates that 25% of the scores fall below
Q1
2. Q3 is called the 3rd quartile (or upper quartile), which indicates that 75% of the scores fall below
Q3

3. Q1  P25 ; Q2  P50 ; Q3  P75 .


4. Q2  D5  P50  Median.

Chapter 3: Data Description 53


EXAMPLE 3−20

The following are the test scores of 12 students in a statistics class:

70, 77, 65, 56, 99, 62, 79, 73, 85, 87, 92, 82

Calculate the following:

1. P80 and interpret its value.


2. D6 .
3. Q1 and Q3 .
4. Percentile rank for the score 92.

SOLUTION

Arrange the data from lowest to highest (ascending order).

56, 62, 65, 70, 73, 77, 79, 82, 85, 87, 92, 99

1. P80 is obtained by:


80(12)
P80  th term
100
 96th term
The value of 9.6th term can be approximated by the 10th term in the ranked data. Therefore,
P80  87
Hence, approximately 80% of the scores are below 87 in the given data.

2. D6  or P60  and is obtained by:


60(12)
P60  th term
100
 7.2 th term
The value of 7.2th term can be approximated by the 8th term in the ranked data. Therefore,
D6  82
Hence, approximately 60% of the scores are below 82 in the given data.

3. Q1  or P25  is obtained by:


25(12)
P25  th term
100
 3rd term

Chapter 3: Data Description 54


The value of 3rd term can be approximated by the average of 3rd and 4th terms in the ranked data.
Therefore,
65  70
Q1   67.5
2

Q3  or P75  is obtained by:


75(12)
th termP75 
100
 9 th term
The value of 9 term can be approximated by the average of 9th and 10th terms in the ranked data.
th

Therefore,
85  87
Q3   86.
2
10  0.5
4. Percentile rank of 92   100%  87.5.
12
Hence, approximately 87.5% of the scores are below 92 in the given data.

EXAMPLE 3−21

Estimate the following from the data given in Example 3−3.

1. P20 .
2. Percentile rank for the score 26.

SOLUTION

Using the percentile graph plotted before,

Percentile Graph
100
90
cumulative percentage

80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught

Chapter 3: Data Description 55


1. Observe the x-value for the y−value 20 and we get P20  14.
2. Observe the y-value for the x−value 26 and we get Percentile rank for the score 26 to be 81.

3.4.5 Other Measures of Variation


The variance and standard deviation are regarded as the best and the most powerful measures of
dispersion. One of the drawbacks with these measures of dispersion is that they are influenced by
extreme observations called outliers. Thus, when there are outliers in a dataset, many statisticians think
that the median as the measure of central tendency and other measures of dispersion, namely the
interquartile range of the quartile deviation, should be used to describe the variability.

The interquartile range is the difference between the upper quartile and the lower quartile. That
is,
Interquartile range (IQR)  Q3  Q1

The quartile deviation is the half of the difference between the upper quartile and the lower
quartile. That is,
Q3  Q1
Quartile deviation (QD)  
2

EXAMPLE 3−22

Find the interquartile range and the quartile deviation for the given data in Example 3−20.

SOLUTION

From Example 3−20, we obtain

Q1  67.5 and Q3  86
Therefore,
Interquartile range  Q3  Q1  86  67.5  18.5
and
Q3  Q1 86  67.5
Quartile deviation    9.25
2 2

Chapter 3: Data Description 56


3.5 Outliers
We already know that values that are very small (or extreme low) or very large (or extreme high) relative
to the majority of the values in a data set are known as outliers. We have seen that outliers strongly affect
the mean, standard deviation and some other measures as well. Therefore, it is important to identify
outliers in the dataset so that we use appropriate measures when outliers are present in the dataset.

An outlier is an extremely high or an extremely low data value when compared with the rest of the
data values.

How does an outlier occur?


There are several reasons why outliers may occur. The data value may have resulted from a:
 Measurement or observational error. That is the researcher measured the variable incorrectly.
 Recording error. That is, it may have been written or typed incorrectly.
 Subject that is not in the defined population.

Procedure for Identifying Outliers


There are several ways to check a dataset for outliers. A good rule of thumb of detecting outlier is as
follows:
Step 1: Arrange the data in ascending order and find Q1 and Q3 .

Step 2: Find the interquartile range: IQR  Q3  Q1 .

Step 3: Find the interval: Q1  1.5  IQR  x  Q3  1.5  IQR .


Step 4: Check the data set for any data values x that fall outside the interval. Those values are
outliers.

EXAMPLE 3−23

Check the following data set for outliers.

70, 5, 12, 6, 15, 13, 18, 30

SOLUTION

The data value 70 is a suspect that it is an outlier. Using the procedure given above we have:

Step 1: The data in ascending order is


5, 6, 12, 13, 15, 18, 30, 70

Using the procedure taught before Q1 = 9 and Q3 = 24.

Step 2: The interquartile range (IQR), IQR = 24 – 9 = 15.

Step 3: The interval is: 9  1.5 15  x  24  1.5  15  13.5  x  46.5 .

Chapter 3: Data Description 57


Step 4: Check the data set for any data values that fall outside the interval from −13.5 to 46.5. Since the
data value 70 is outside this interval, it can be considered an outlier.

3.6 Exploratory Data Analysis (EDA)


In traditional statistics, data are organized by using a frequency distribution and various graphs are
constructed to determine the shape or nature of the distribution. Exploratory Data Analysis (EDA) is
the process of using graphical and descriptive statistical techniques (like median, IQR) to learn about the
structure of a dataset.

In EDA,
 Data can be organised using a stem and leaf plot.
 The measure of central tendency used is the median.
 The measure of variation used is the interquartile range.
 Data are represented graphically using a box-plot.

A box-plot is a graph that is used to determine the nature and shape of the distribution in EDA. It is
obtained by drawing a horizontal line from the minimum data value to Q1 , drawing a horizontal line from
Q3 to the maximum data value, and drawing a box whose vertical sides pass through Q1 and Q3 with
a vertical line inside the box passing through the median.

Information obtained from a Box-plot


a. If the median is near the center of the box or the lines are about the same length, the distribution is
approximately symmetric.
b. If the median is to the left of the center of the box or the right line is larger than the left line, the
distribution is positively skewed.
c. If the median falls to the right of the center of the box or the left line is larger than the right line,
the distribution is negatively skewed.

EXAMPLE 3−24

Construct a box-plot for the data given below.

16, 18, 12, 11, 8, 13, 4, 3, 9, 20

SOLUTION

Step 1: The Five-Number Summary (Note: The data should be arranged in ascending order first)
1. The lowest value is 3;
2. Q1  8 ;
3. The median is 11.5;
4. Q3  16 ;
5. The highest value is 20;

Step 2: Draw a horizontal axis with a suitable scale.

Chapter 3: Data Description 58


Step 3: Draw a horizontal line from the minimum data value to Q1 , then draw a horizontal line from Q3
to the maximum data value, and then draw a box whose vertical sides pass through Q1 and Q3 with a
vertical line inside the box passing through the median.

Therefore, the boxplot is given below:

 8  1  1
 3 1 6 
.
5

 0 4 8 12 16 20 22

The distribution is somewhat symmetric.

3.7 Summary
This chapter discusses the statistical techniques of describing data. The data was described using the
techniques such as measure of central tendencies, measure of variations and measure of positions. The
measure of central tendencies include mean, median, mode and midrange to locate the center of the
data set, the measure of variations include range, variance and standard deviation to gauge the spread
of data values, the measure of positions include standard score, percentile, decile and quartile to locate
the position of the data values. Further, the chapter explains how to detect outliers in a data set and how
to construct box-plot.

EXERCISES

1. The cash compensations received in 2009 by the highest-paid executives of 12 international


companies (in $000s) were as follows:

2215 1888 1477 1059 977 956


947 924 899 856 856 803

A. Compute the mean, median, mode and the standard deviation.


B. Calculate the values of three quartiles, 40th percentile and the percentile rank of 956.
C. Check for outliers in the data.
D. Construct a box-plot and use it comment on the shape of the distribution.

2. A survey of all the 110 firms in a small state was carried out to find the number of people employed
at each. The results are shown in the following table.

Number of Employees 1 – 10 11 – 20 21 – 30 31 – 40 41 – 50
Frequency 32 34 14 12 18

Chapter 3: Data Description 59


A. Approximate the mean, the mode and the median of the number of people employed at each
firm.
B. Calculate the variance and standard deviation.

3. Suppose an instructor gives two exams and a final exam, assigning the final exam a weight twice
that of each of the other exams. Find the weighted mean for a student who scores 73 and 67 on the
first two exams and 85 on the final exam.

4. An analysis of monthly wages paid to the workers of firm A and B belonging to the same industry
gives the following results:

Firm A Firm B
Number of Workers 100 200
Average monthly wage $196 $185
Variance of distribution of wages $81 $144

A. Which firm, A or B has a larger wage bill?


B. In which firm, A or B is there greater variability among individual wages?

Chapter 3: Data Description 60


CHAPTER 4:

PROBABILITY (PART I)

Chapter 4: Probability (Part I) 61


Overview
This chapter introduces the concepts of probability. It explains the basic terms and concepts such as
probability; probabilistic experiments; sample space; event; complement intersection and union of events;
classical, empirical and subjective probability; the additional rule and mutually exclusive events. The
chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Find the sample space of probabilistic experiments.
2. Calculate the probability using classical and empirical approach.
3. Calculate the probability using the addition rule.

4.1 Introduction
In this section, we introduce students to probability, that is where probability can be used and the definition
of probability. It further outlines other concepts that you will learn in this chapter.

What is Probability?
No doubt, you are familiar with terms such as probability, chance and likelihood. They are often used
interchangeably. Statements that involve probability are:

 The weather forecaster announces that there is an 80 percent chance of rain in a soccer match.
 The probability that a certain brand of computer will survive 100,000 hours of operation without
repair is 0.75.
 What are chances of Fiji winning the IRB series this year?

Probability, which is an important part of statistics, is a number that describes the chance that something
will happen. A more formal definition is:

Probability is the numerical measure of the likelihood that a specific event will occur.

Many people are familiar with probability from observing or playing various games of chance using cards,
coins and dice, or in lotteries. In addition to being used in games of chance, probability theory is often
used for explaining many real-world phenomena and helps us in decision-making in the fields of
insurance, investments, and weather forecasting and in various other areas. Finally, probability theory is
the basis of inferential statistics, which we will discuss in later Chapters in this course.

In this chapter, the basic concepts of probability are explained. These concepts include probability
experiments, sample spaces, outcomes, events and many others. Further, this Chapter also explains the
three basic interpretations of probability, mutually exclusive events and the addition rules of probability.

Chapter 4: Probability (Part I) 62


4.2 Basic Concepts in Probability
This section explains some basic concepts of probability as follows:
 Experiment is any activity that yields a result or an outcome.
 Probability (or random) experiments are those where the outcome cannot be predicted in
advance. For example, if we toss a coin, the outcome may be either 'head' or 'tail'. But we cannot
predict in advance which one will occur exactly. Other examples of probability experiments are
rolling a die, drawing a card from a deck, couple planning to have a child and answering true/false
question.
 An outcome is the result of a single trial of a probability experiment.
 A trial means flipping a coin once, rolling one die once or the like.
 A sample space denoted by S is a set of all possible outcomes of a probability experiment.

EXAMPLE 4−1

The sample spaces for the following probability experiments are:

Experiment Sample Space


Tossing a coin once S = {H, T}

Rolling a die once S = {1, 2, 3, 4, 5, 6}

Answering a true-false question S = {True, False}

Play lottery S = {Win, Lose}

Tossing two coins or tossing a coin two S = {HH, HT, TH, TT}
times

Tossing a coin and then rolling a die S = { H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6}

Chapter 4: Probability (Part I) 63


EXAMPLE 4−2

Find the sample space for rolling two dice.

SOLUTION

Since each die can land in six different ways, and two dice are rolled, the sample space can be presented
by a rectangular array as follows:

Die 2
Die 1
1 2 3 4 5 6

1 (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

2 (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)

3 (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)

4 (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

5 (5,1) (5,2) (5,3) (5,4) (5,5) (5,6)

6 (6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

EXAMPLE 4−3

Find the sample space for drawing one card from an ordinary deck of cards.

SOLUTION

There are 52 cards in an ordinary deck so the sample space is:

Spades: 2, 3, 4, 5, 6, 7, 8, 9, 10, jack, queen, king, ace


Clubs: 2, 3, 4, 5, 6, 7, 8, 9, 10, jack, queen, king, ace
Diamonds: 2, 3, 4, 5, 6, 7, 8, 9, 10, jack, queen, king, ace
Hearts: 2, 3, 4, 5, 6, 7, 8, 9, 10, jack, queen, king, ace

A tree diagram is a device consisting of line segments emanating from a starting point and also from
the outcome point. It is used to determine the sample space in a systematic way.

Chapter 4: Probability (Part I) 64


EXAMPLE 4−4

Use a tree diagram to find the sample space for a family of three children.

SOLUTION

Since there are two possibilities (boy or a girl) for the first child, draw two branches from a starting point
and label one B and the other G. Then if the first child is a boy, there are two possibilities for the second
child (boy or a girl), so draw two branches from B and label one B and the other G. Do the same if the
first child is a girl. Follow the same procedure for the third child. The completed tree diagram is shown
below. To find the outcomes for the sample space, trace through all possible branches.

B BBB

BBG
G
BGB
B
B BGG
G GBB

G GBG

B GGB

B GGG

G
G

Chapter 4: Probability (Part I) 65


4.2.1 Event
An event consists of a set of outcomes of the sample space. Events are mostly denoted using capital
letters of the alphabets.

For example, in the experiment of tossing two coins, where the sample space is S = {HH, HT, TH, TT}.
We can denote an event E to be getting 2 heads that is E = {HH} or event F to be getting no heads that
is F = {TT}.
 A simple event is an event with only one sample point.
 A compound event is an event with more than one sample point.
 An event, which does not contain any sample point, is called an impossible event (or null event
or empty event). It is denoted by 0 .
 An event, which contains all the sample points of the sample space, is called sure (or certain)
event.

EXAMPLE 4−5

In an experiment of throwing a die, classify the events below as simple, compound, sure or impossible
event.

A. Getting a six
B. Getting even faces
C. Getting even or odd faces
D. Getting a seven

SOLUTION

A. Simple
B. Compound
C. Compound and Sure
D. Impossible

A Venn diagram uses circles to represent sets, in which the relations between the sets are indicated
by the arrangement of the circles.

For example, out of forty students, 14 are taking English and 29 are taking chemistry at USP. If five
students are in both classes, the Venn diagram to represent this is:

Chapter 4: Probability (Part I) 66


4.2.2 Complement of an Event
The complement of an event A , denoted as A , is an event, which contains all the sample points of S
except those belonging to A. It can be represented using a Venn-diagram as follows:

S
A
A

EXAMPLE 4−6

Consider the experiment of rolling a die, the sample space is S = {1, 2, 3, 4, 5, 6} If A = {1, 3, 5} then Ā =
{2, 4, 6}.

4.2.3 Intersection of Two Events


The intersection of two events A and B, denoted by A ∩ B, is the event containing all sample points that
are common to A and B . It can be represented using a Venn-diagram as follows:

A B S

A B

EXAMPLE 4−7

In the above example A = {1, 3, 5} and if B = {1, 3, 6} Then A ∩ B = {1, 3}.

4.2.4 Union of Two Events


The union of two events A and B, denoted by A U B, is the event containing all sample points in either in
A or in B or in both A and B. It can be represented using a Venn-diagram as follows:

A B S

A B

EXAMPLE 4−8

In the above example A = {1, 3, 5} and if B = {1, 3, 6} Then A U B = {1, 3, 5, 6}.

Chapter 4: Probability (Part I) 67


4.3 Interpretations of Probability
Before we discuss about the interpretations of probabilities, let us first look at some basic probability
rules.

Basic Probability Rules


1. The probability of any event A lies between 0 and 1, that is, 0 < P (A) < 1.
2. If an event A cannot occur, its probability is 0.
3. If an event A is sure or certain, its probability is 1.
4. The sum of the probabilities of all the outcomes in the sample space is 1.
5. The probability of a complementary event of A is defined by: P (Ā) = 1 ─ P (A).

The three basic interpretations of probabilities are:


 Classical probability,
 Empirical or relative frequency probability, and,
 Subjective probability.

4.3.1 Classical Probability


Classical probability uses sample spaces rather than actually performing the experiment to determine the
probability of an event. It assumes that all outcomes in the sample space are equally likely to occur. For
example, when a die is rolled, each outcome has the same probability of occurring. Since there are six
outcomes, each outcome has a probability of 1/6.

Formula for Classical Probability


The probability of any event E is:
No. of outcomes in E
P( E ) 
No. of outcomes in S
n( E )
 .
n( S )

EXAMPLE 4−9

A die is rolled once. Find the probability of the following events:


1. A: occurrence of an odd number,
2. B: a number less than 5 occurs,
3. C: a number more than 3 or an odd number,
4. A’,
5. B ∩ C.

Chapter 4: Probability (Part I) 68


SOLUTION

Here, S = {1, 2, 3, 4, 5, 6}.

1. A = {1, 3, 5}. Therefore, P (A) = 3/6 = 1/2.


2. B = {1, 2, 3, 4}. Therefore, P (B) = 4/6 =2/3.
3. C = {1, 3, 4, 5, 6}. Therefore, P(C) = 5/6.
4. A' = {2, 4, 6}. Therefore, P (A') = 3/6 = 1/2 or P ( A’ ) = 1 ─ 1 /2 = 1 /2.
4. B  C = {1, 3, 4}. Therefore, P ( B  C ) = 3/6 = 1/2.

EXAMPLE 4−10

Find the probability of getting a red ace when a card is drawn from an ordinary deck of cards.

SOLUTION

Let R = red ace. Since there are 52 cards and 2 red aces (the ace of hearts and ace of diamonds) in an
ordinary deck of cards, P(R) = 2/52 = 1/26.

EXAMPLE 4−11

If a family has three children, find the probability of the following events:

A. A: All are boys.


B. B: Exactly two are boys.

SOLUTION

Refer to the sample space in Example 4–3. There are 8 outcomes in the sample space.

A. A = {BBB}. Therefore, P (A) = 1/8.


B. B = {BBG, BGB, GBB}. Therefore, P (B) = 3/8.

EXAMPLE 4−12

Two dice are rolled. Find the probability of the following events:
A. E: The sum of faces is equal to 7.
B. F: The sum of faces is greater than 7.
C. G: The sum of faces is 7 or 11.

Chapter 4: Probability (Part I) 69


SOLUTION

Refer to the sample space in Example 4-2. The total number of outcomes is 36.
A. There are 6 outcomes in the sample space whose sum is 7. Therefore, P (E) = 6 / 36 = 1 / 6.
There are 15 outcomes in the sample space whose sum is greater than 7. Therefore,

B. P (F) = 15 / 36 = 5 / 12.

C. There are 8 outcomes in the sample space whose sum is 7 or 11. Therefore, P (G) = 8 / 36 = 2
/ 9.

4.3.2 Empirical or Relative Frequency Probability


The difference between classical and empirical probability is that empirical probability relies on actually
performing the experiment to determine the probability of an event and the outcomes in the sample space
may not be equally likely.

Formula for Empirical Probability


Given a frequency distribution, the probability of an event E being in a given class is:

frequency of the class


P( E ) 
total frequency
f

n

EXAMPLE 4−13

A marble is drawn from a bag containing 3 white, 2 red and 5 blue marbles. What is the probability that
the marble drawn is:
A. green,
B. white,
C. not white, and
D. White or red.

SOLUTION

A. P(green) = 0/10 = 0.
B. P(white) = 3/10.
C. P(not white) = 1 − P(white) = 1 − 3/10 = 7/10.
D. P(white or red) = 5/10.

Chapter 4: Probability (Part I) 70


EXAMPLE 4−14

In a sample of 50 people, 21 had type O blood, 22 had A blood, 5 had type B blood, and 2 had type AB
blood. Construct a frequency distribution and find the probability that:
A. A person has type A blood.
B. A person has type A or type B blood.
C. A person neither type A nor type O blood.
D. A person does not have type O blood.

SOLUTION

The frequency distribution is as follows:

Blood Type Frequency


A 22
B 5
AB 2
O 21
50

Using the frequency distribution, we have


A. P (A) = 22 / 50 = 11 /25.
22  5
B. P(A or B)   27 / 50.
50
52
C. P(neither A nor O)  P(B or AB)   7 / 50.
50
21
D. P(not O)  1  P(O)  1   29 / 50.
50

EXAMPLE 4−15

A computer supplies store is concerned that it may be over-stocking printers. The store has tabulated the
number of printers sold weekly for each of the past 80 weeks. The results are summarized in the following
table:

No. of printers sold 0 1 2 3 4


Number of weeks 36 28 12 2 2

Chapter 4: Probability (Part I) 71


The store intends to use this data as a basis for forecasting printer sales in any given week.
A. Assign probabilities to each of the individual outcomes.
B. What approach did you use in determining the probabilities?
C. Find the probability of selling at least 3 printers in any given week.

SOLUTION

A.
No. of printers sold 0 1 2 3 4
Probability 36/80 28/80 12/80 2/80 2/80

B. Empirical
C. 4/80 = 1/20

4.3.3 Subjective Probability


The third type of probability is called subjective probability. Subjective probability uses a probability value
based on an educated guess or estimate, experience and beliefs. For example, a physician might say
that, under the basis of her diagnosis, there is a 30% chance the patient will need an operation, the
weather broadcast might say there are 70% probability that it is going to rain tomorrow.

4.4 The Addition Rules for Probability


Before we discuss the addition rules for probability, it is important that we discuss the mutually exclusive
events.

Mutually Exclusive Events


Two events A and B are said to be mutually exclusive if they have no outcomes in common that is
A ∩ B = {} = Ø.

EXAMPLE 4−16

Determine which events are mutually exclusive and which are not when a single die is rolled.

A. Getting an odd number and getting an even number.


B. Getting a 4 and getting an even number.

SOLUTION

A. The first event has outcomes 1, 3, 5 and the second event has outcomes 2, 4, 6, therefore the
events are mutually exclusive since there is no outcome in common.
B. The first event has outcome 4 and the second event has outcomes 2, 4, 6, therefore the events
are not mutually exclusive since 4 is common in both events.

Chapter 4: Probability (Part I) 72


EXAMPLE 4−17

Determine which events are mutually exclusive and which are not when a single card is drawn from a
deck.
A. Getting a 3 and getting a 6.
B. Getting a 3 and getting a diamond.
C. Getting a red card and getting an ace.

SOLUTION

A. The events are mutually exclusive since there is no card in common.


B. The events are not mutually exclusive since the card, 3 of the diamonds is common in both
events.
C. The events are not mutually exclusive since the two red aces are common in both events.

Addition Rule
If A and B be any two events, then the probability of the occurrence of either event A or event B is
1. P (A or B) = P (A) + P(B), when A and B are mutually exclusive.
2. P (A or B) = P( A ) + P ( B) — P (A ∩ B), when A and B are not mutually exclusive.

Note:
The above rules can be extended to more than two events.

EXAMPLE 4−18

In a class, there are 20 Fijian, 13 Samoan, and 6 Tongan students. If a student is selected at random,
find the probability that he/she is either a Fijian or Tongan student.

SOLUTION

Let the events, A = Fijian student and B = Tongan Student.

P (A) = 20/39, P (B) = 6 / 39


∴ P (A or B) = P (A) + P (B) Since A and B are mutually exclusive.
= 20 /39 + 6 /39
= 26 /39.

EXAMPLE 4−19

A single card is drawn from a deck. Find the probability that it is a spade or an ace.

Chapter 4: Probability (Part I) 73


SOLUTION

Let the events, A = card is spade and B = card is ace.


P (A) = 13 / 52, P (B) = 4 / 52, P (A ∩ B) = 1 /52.
∴ P (A or B) = P (A) + P (B) — P (A ∩ B) Since A and B are mutually exclusive.
= 13 /52 + 4 /52 — 1/52
= 16 / 52 = 4 /13

EXAMPLE 4−20

A Mac Donald’s consumer is selected at random. The probability he has tried a Big Mac is 0.5, tried soft
Cone is 0.6 and tried both Big Mac and soft Cone is 0.2. Find the following probability:

A. He tried Big Mac or Soft Cone.


B. He tried only the Soft Cone.
C. He tried neither the Big Mac nor the Soft Cone.
D. He did not try Big Mac.

SOLUTION

Construct a Venn diagram. The Venn diagram is shown below.

B S

0.3 0.2 0.4

0.1

Using the Venn diagram, we get:

A. 0.9
B. 0.4
C. 0.1
D. 0.5

4.5 Summary
In this chapter, we were looked at the basic concepts of probability. It explained the terms and concepts
such as probability; experiments; probabilistic and non-probabilistic experiments; sample space;
outcome; tree diagram and Venn diagram; event; simple, compound, null and sure events; complement,
intersection and union of events. Later, it discussed the three interpretations of probability that are
classical, empirical and subjective probability and the additional rules of probability.

Chapter 4: Probability (Part I) 74


EXERCISES

1. A coin is tossed; if it falls head up, it is tossed again. If it falls tail up, a die is rolled. Draw a tree
diagram and determine all possible outcomes.

2. Probability can be classified into three basic approaches or interpretations.


A. List the three approaches.
B. In an experiment of tossing a coin 10 times, only 2 heads appeared, hence the probability of
getting a head is 0.2. Which approach is used here? Explain briefly.

3. Classify the events below as simple or compound. Explain your choice.


A. Getting a head in tossing a coin.
B. Getting an even number when rolling a die.

4. In USP, the probability that a student takes calculus or is on scholarship is 0.85. The probability that
a student is on scholarship is 0.61 and the probability that a student is taking calculus is 0.31.
A. Are events C: student takes calculus, and S: student is on scholarship mutually exclusive events?
Explain.
B. If a student is randomly chosen, find the probability that the student is taking calculus and is on
scholarship.
C. If a student is randomly chosen, find the probability that the student is neither taking calculus nor
is on scholarship.
5. For a card drawn from an ordinary deck, find the probability of getting a:
A. Queen
B. 3 and a diamond
C. 3 or a diamond
D. 3 or a 6

6. In a hospital unit, there are 8 nurses and 5 doctors; 7 nurses and 3 doctors are females. If a staff is
selected at random, find the probability that the staff:
A. Is a female
B. Is a nurse and a female?
C. Is a nurse or a female?

7. Tom and Jerry rolls two dice 50 times and record the sum of the rolls of two dice in the table below.

Sum of the rolls of two dice


3 5 5 4 6 7 7 5 9 10
12 9 6 5 7 8 7 4 11 6
8 8 10 6 7 4 4 5 7 9
9 7 8 11 6 5 4 7 7 4
3 6 7 7 7 8 6 7 8 9

A. What is their empirical probability of rolling a 7?


B. What is the classical probability of rolling a 7?
C. How do the empirical and theoretical probabilities company?

Chapter 4: Probability (Part I) 75


CHAPTER 5:

PROBABILITY (PART II)

Chapter 5: Probability (Part II) 76


Overview
This chapter explains the more sophisticated concepts in probability such as independent events;
conditional probability; probability and counting rules. The chapter concludes with a summary and a set
of exercises.

Objectives
After completing this chapter, you should be able to:
1. Find the probability of compound events using multiplication rules.
2. Find the conditional probability of an event.
3. Utilize the fundamental counting rule, permutation and combination.
4. Find the probability of an event using the counting rules.

5.1 Introduction
In the previous chapter, we have looked at some basic concepts of probability. Further, we also explained
the three basic interpretations of probability, the concepts of mutually exclusive events and the addition
rules.

The purpose of this chapter is to look at some more concepts of probability such as independent events,
dependent events, conditional probability and counting rules

5.2 Independent Events


In this section, we are going to explain what it means by two events to be independent. For example, if
you toss a coin and then roll a die, the events getting a head on the coin and getting a 6 on the die are
said to be independent. This is because the probability of getting a 6 on the die was not affected by
getting a head on the coin. A more formal definition of independent events is as follows:

Two events A and B are independent events if the fact that A occurs does not affect the
probability of B occurring.

EXAMPLE 5−1

Here are other examples of independent events:


 Having a large shoe size and having a high IQ.
 Rolling a 4 on a single 6-sided die, and then rolling a 1 on a second roll of the die
 Drawing a queen from an ordinary deck of cards, replacing it, and then drawing an ace.

To test for independence of two events, we can use the following rule:

Two events A and B are said to be independent if and only if

P (A ∩ B) = P (A) × P (B).

Chapter 5: Probability (Part II) 77


EXAMPLE 5−2

A coin is flipped and a die is rolled. Find the probability of getting a head on the coin and a 6 on the die.

SOLUTION

Let A = getting a head on the coin and B = getting a 6 on the die. The events A and B are independent,
therefore

P (A ∩ B) = P (A) × P (B)
= 1 /2 × 1 / 6 = 1 /12

EXAMPLE 5−3

In a group of 60 students, 20 study History, 24 study French and 8 study both History and French. Are
the events a student studies History and a student studies French independent?

SOLUTION

From the information given:

20 1 24 2 8 2
P(History)    P(French)    P(History and French)    Now,
60 3 60 5 60 15
1 2 2
P(History)  P(French)    
3 5 15
 P (History and French) = P (History)  P (French).

Hence, the two events are independent.

EXAMPLE 5−4

At USP 74.3% of the incoming first year students have computers. If 2 students are selected at random,
find the probabilities.

A. None have computers


B. Exactly one has computer
C. At least one has a computer

SOLUTION

Let C = student has a computer and N = the student does not have computer.
The tree diagram for this problem is as follows:

Chapter 5: Probability (Part II) 78


C
C 0.743

0.257 N
0.743

N C
0.257 0.743

0.257 N

Here are events are independent, so using the tree diagram,


A. P (None have computers) = P (NN) = 0.257 × 0.257 = 0.066.
B. P (Exactly one has computer) = P (CN) + P (NC)
= 0.743 × 0.257 + 0.257 × 0.743
= 0.382.

C. P (At least one has computer) = 1 — P (None have computers)


= 1 — 0.066
= 0.934.

Note: The above rules of independent can be extended to more than two events. That is if A, B and C
are independent events then P (A ∩ B ∩ C) = P (A) × P (B) × P(C).

5.3 Conditional Probability and Dependent Events


5.3.1 Conditional Probability
The probability that the second event B occurs given that the first event A has already occurred is
called a conditional probability and is written as P(B | A). P (B | A) can be found by dividing the
probability of both events occurring by the probability of first event A, that is

P( A  B)
P  B | A  , provided P(A)  0.
P( A)

EXAMPLE 5−5

In a certain city, the probability that an automobile will be stolen and found within one week is 0.0009.
The probability that an automobile will be stolen is 0.0015. Find the probability that a stolen automobile
will be found within one week.

SOLUTION

Let the events, A: Automobile is stolen B: automobile is found.

Then P (A ∩ B) = 0.0009 and P (A) = 0.0015.

Chapter 5: Probability (Part II) 79


Therefore,

P( A  B) 0.0009
P( B A)    0.6.
P( A) 0.0015

EXAMPLE 5−6

A random sample of 200 adults is classified below according to gender and level of education attained.

Gender
Education Total
Male Female
Elementary 38 45 83

Secondary 28 50 78

College 22 17 39
Total 88 112 200

If a person is picked at random from this group, find the probability that:
A. The person is male, given that the person has secondary education.
B. The person does not have a college degree, given that the person is a female.

SOLUTION
P(male  secondary)
A. P(male secondary) 
P(secondary)
28 / 200

78 / 200
28
  0.36
78
P(no college degree  female)
B. P(no college degree female) 
P(female)
(45  50) / 200

112 / 200
95
  0.85
112

Chapter 5: Probability (Part II) 80


5.3.2 Dependent Events
Two events A and B are dependent events if the fact that A occurs does affect the probability of B
occurring.

EXAMPLE 5−7

Here are other examples of dependent events:

A. Getting high grades and getting a scholarship.


B. Getting a rise in the salary and buying a new car.
C. Drawing a queen from an ordinary deck cards, not replacing it, and then drawing an ace.

Two events A and B are said to be dependent if and only if


P (A ∩ B) = P (A) × P (B | A).

EXAMPLE 5−8

A company estimates that 30% of the country has seen its commercial and that if a person sees its
commercial, there is 20% probability that the person will buy its products. What is the probability that a
person chosen at random in the country has seen the commercial and bought the product?

SOLUTION

Let A = the person sees the commercial and B = the person buys the commercial. Therefore,

P (A and B) = P (A) × P (B | A).


= 0.3 × 0.2 = 0.06.

EXAMPLE 5−9

A flashlight has 6 batteries, 2 of which are defective. If 2 are selected at random without replacement,
find the probability that:
A. Both are defective.
B. None are defective.
C. At least one is defective.

SOLUTION

Let D = the battery is defective and G = the battery is good. The tree diagram for this problem is

Chapter 5: Probability (Part II) 81


1/5 D

D
2/6 4/5
G

D
4/6 2/5
G

3/5 G

Note: The second branch has conditional probabilities, that is 1/5 is the probability that the second battery
is defective given that the first battery was defective. Similarly, 3/5 is the probability that the second
battery is good given that the first battery was good.
Using the tree diagram,

A. P (Both are defective) = P (DD) = 2 / 6 × 1 / 5 = 1 /15.


B. P (None are defective) = P (GG) = 4 / 6 × 3 / 5 = 2 /5.

C. P  At least one is defective   1  P(None defective)


 1 2 / 5
 3/5
C. P  Atleast one has a computer   1  P  None have computers 
 1  0.066
 0.934

EXAMPLE 5−10

Three cards are drawn from an ordinary deck without replacement. Find the probability of these.

A. Getting 3 jacks.
B. Getting an ace, a king, and a queen in order.
C. At least one jack.

SOLUTION
4 3 2 1
A. P  3 jacks      .
52 51 50 5525
4 4 4 8
B. P  an ace, a king and then a queen      .
52 51 50 16,575

Chapter 5: Probability (Part II) 82


C. P  At least one jack   1  P  None are jacks 
48 47 46 1201
 1    .
52 51 50 5525

5.4 Counting Rules


In this section, we discuss how to find the number of outcomes in an event or in a sample space using
the following counting rules:
 Fundamental counting rule,
 Permutation rule, and
 Combination rule

5.4.1 Fundamental Counting Rule


In a sequence of n events in which the first one has k1 possibilities and the second has k2 possibilities
and the nth one has kn possibilities, the total number of possibilities of the sequence will be
k1 × k2 × … × k n

EXAMPLE 5−11

If a coin is tossed:
 two times, then the total number of outcomes = 2 × 2 = 22 = 4.
 three times, then the total number of outcomes = 2 × 2 × 2= 23 = 8.
 r times, then the total number of outcomes = 2r.

EXAMPLE 5−12

If a die is rolled:
 two times, then the total number of outcomes = 6 × 6 = 62 = 36.
 r times, then the total number of outcomes = 6r.

EXAMPLE 5−13

How many different license plate numbers can be made using two letters followed by three digits, if letters
and digits may be repeated?

SOLUTION

Since there are 26 alphabets (A, B, C, X, Y, Z) and 10 digits (0, 1, 2, …, 9) that can be used to form a
license plate number, then the total license plate numbers possible is

26 × 26 × 10 × 10 × 10 = 676000.

Chapter 5: Probability (Part II) 83


EXAMPLE 4−14

The chairs in a room are to be labelled with a vowel letter and a positive integer not exceeding 99. What
is the largest number of chairs than can be labelled differently?

SOLUTION
Since there are 5 vowels (A, E, I, O, U) and 99 integers not exceeding 99 (1, 2… 99) that can be used to
label the chair, then the largest number of chairs than can be labelled differently are

5 × 99 = 495.
Factorial Notation
Before discussing the permutation, we introduce a useful shorthand notation-the factorial symbol. The
symbol n! read as “n factorial,” is defined as:
n! = n(n — 1)(n— 2) × …× 3 × 2 × 1
Where,
 0! = 1
 1! = 1.

For example, 5! can be written as 5 × 4 × 3 × 2 × 1 = 120 (factorial can be computed directly using a
calculator)

5.4.2 Permutation Rule


A permutation is an arrangement of distinct objects in a specific order. The number of permutations of r
objects arranged from n distinct objects is defined as:
n!
P(n, r )  n Pr 
(n  r )!

EXAMPLE 5−15

The letter a, b, c can be arranged in six different ways, that is

abc acb bac bca cab cba

This can be computed using P (3, 3) = 6 ways.

Note: P(n, n)  n!

EXAMPLE 5−16

How many ways are there to select a first-prize winner, a second-prize winner and a third-prize winner
from 50 different students who have entered a mathematics contest?

SOLUTION P (50, 3) = 117600.

Chapter 5: Probability (Part II) 84


EXAMPLE 5−17

How many different ways can a chairperson and an assistant chairperson be selected for a research
project if there are seven scientists available?

SOLUTION P (7, 2) = 42.

EXAMPLE 5−18

How many 3 digit numbers that can be formed from the digits: 1, 2, 3, 4, 5, 6, 7?

SOLUTION P (7, 3) = 210.

Permutation Rule (Objects not Distinct)


The number of distinct permutations of n things of which n1 are of one kind, n2 of a second kind, ... , nk
of the kth kind is
n!
.
n1 !n2 !...nk !

EXAMPLE 5−19

How many distinct ways the letters in the word "STATISTICS" can be arranged?

SOLUTION

Since there are 3 S's, 3 T's and 2 I's, the number of distinct ways the letters can be arranged is
10!
 50, 400 .
3!3!2!

EXAMPLE 5−20

How many different vertical arrangements are possible for 10 flags if 2 are white, 3 are red and 5 are
blue?

SOLUTION

Since there are 2 white, 3 red and 5 blue flags, the number of different vertical arrangements possible is
10!
 2520.
2!3!5!

Chapter 5: Probability (Part II) 85


5.4.3 Combination Rule
A selection of distinct objects without regard to order is called a combination. The number of combinations
of r objects selected from n distinct objects is defined as

n!
C (n, r )  nCr  .
(n  r )!r !

EXAMPLE 5−21

How many ways are there to select six players from a 15-member volleyball team for a challenge match
against another department?

SOLUTION C (15, 6) = 5005.

EXAMPLE 5−22

How many different ways can a lecturer select two textbooks from a possible of 17?

SOLUTION C (17, 2) = 136.

EXAMPLE 5−23

There are 7 women and 5 men in a department. A committee of 4 is to be formed.


A. How many ways can a committee of 4 be selected?
B. How many ways can this committee be selected if there must be 2 men and 2 women?
C. How many ways can this committee be selected if there must be at least 2 women on the
committee?

SOLUTION

A. Since there are 12 people and 4 is to be selected on the committee, hence there are C (12, 4) =
495 ways.
B. There are of C (5, 2) choosing 2 men and C (7, 2) of choosing 2 women, hence there are C (7,
2) × C (5, 2) = 210 ways.
C. At least 2 women on the committee means, 2 women or 3 women or 4 women on the committee.
There are C (7, 2) × C (5, 2) = 210 ways to have 2 women, C (7, 3) × C (5, 1) = 175 ways to have
3 women, C (7, 4) × C (5, 0) = 35 ways to have 3 women, hence there are 210+175+35=420
ways.

Chapter 5: Probability (Part II) 86


5.5 Probability and Counting Rules
The three counting rules that we have learnt in the previous section can be combined with probability
rules to solve many types of probability problems.

EXAMPLE 5−24

Find the probability that if 10 different-sized books are arranged in a row, they will be arranged in order
of size.

SOLUTION

There are =10! ways of arranging 10 books in a row,


so n( S )  3628800.
E  Arranging 10 books in order of size.
Since there 2 ways arranging 10 books in order of size (Accending or decending), n( E )  2.
2
P (E )  .
3628800

EXAMPLE 5−25

Five cards are drawn from a pack of 52 cards. What is the probability that:
A. All are spades,
B. 2 are hearts and 3 are diamonds, and
C. All are black.

SOLUTION

A pack of cards contains 52 cards out of which 13 are spades, 13 are hearts, 13 are diamonds and 13
are clubs. If 5 cards are drawn, then:
13
C5 1287 33
A. P(all are spades)  52
   0.0005.
C5 2598960 66640

C2  13C3 78  286
13

B. P(2 are hearts and 3 are diamonds)  52


  0.0086 .
C5 2598960
C. Out of 52 cards 26 are black and 26 are red. Therefore,
26
C5 65780
P(all are black)  52
  0.0253 .
C5 2598960

Chapter 5: Probability (Part II) 87


EXAMPLE 5−26

A fair coin is tossed 5 times. Find the probability that:


A. All are heads
B. Exactly 2 heads appear.
C. At least 4 heads appear.

SOLUTION

There are 25 = 32 total number of outcomes when a coin is tossed 5 times.


C 1
A. P (getting all heads)  5 5  .
32 32
C 10
B. P (getting exactly 2 heads)  5 2  .
32 32

C. P  atleast 4 heads   P (4 heads)  P (5 heads)


C4 5 C5
 5

32 32
5 1 6
  
32 32 32

EXAMPLE 5−27

What is the probability that a four-digit telephone extension has one or more repeated digits?

SOLUTION

There are 104 possible 4-digit telehone extensions, so n( S )  10000.


E  one or more digits are repeated, then E  none of the digits are repeated.
n( E )  10 P4 , P( E )  10 P4 / 104  0.504,
hence P( E )  0.496.

5.6 Summary
In this chapter, we discussed the more advance concepts in probability such as such as independent and
dependent events and conditional probability. Later, we also discussed the counting rules such as
fundamental counting rule, permutation and combination to solve some probability problems.

Chapter 5: Probability (Part II) 88


EXERCISES

1. In a scientific study, there are 8 guinea pigs, 5 of which are pregnant. If 3 are to be selected at random
to be used in the experiment, find the probability that:
A. All three are pregnant.
B. Exactly 2 are pregnant.
C. At least one is pregnant.

2. Approximately 10% of the students in USP owns a car. If 3 students are selected at random, find the
probability that:
A. All of them own a car.
B. Exactly 2 own a car.
C. At least one own a car.

3. The following table gives the two-way classification of 400 students based on gender and whether or
not they work while being full-time students.

Work Do not work


Male 120 60
Female 130 90

A. A student is randomly selected from this group of 400 students. What is the probability that this
student:
i. does work
ii. work or is male
iii. female and does not work
iv. does not work given male

B. Are the events “male” and “do not work” mutually exclusive events? Explain why or why not.
C. Are the events “female” and “do not work” independent? Explain why or why not.

4. Urn 1 contains 5 red marbles and 3 black marbles. Urn 2 contains 3 red marbles and 1 black marble.
If an urn is selected at random and a marble is drawn, find the probability it will be black.

5. Two cards are drawn at random (without replacement) from a regular deck of 52 cards.
A. What is the probability that the first card is a red and the second card is heart?
B. What is the probability that the first card is a heart and the second card is red?

6. There are 2 roads between town A and B. There are 4 roads between town B and C. How many
different routes may one travel from town A to town C through town B?

7. A student wants to arrange the letters in the word SUNDAY.


A. How many different ways are there to arrange the letters in the word SUNDAY?
B. If we insist that the letter S come first, how many ways are there?
C. If we insist that the letter S come first and the letter Y be last, how many ways are there?

Chapter 5: Probability (Part II) 89


8. A group of 9 people is going to be formed into committees of 4, 3, and 2 people. How many
committees can be formed if:
A. A person can serve on any number of committees?
B. No person can serve on more than one committee?

9. A committee of 5 people is to be formed from 6 doctors and 9 dentists. Find the probability that the
committee will consist of:

A. All dentists
B. 2 dentists and 3 doctors

10. What is the probability that a seven-digit phone number contains the number 7?

Chapter 5: Probability (Part II) 90


CHAPTER 6:

DISCRETE PROBABILITY
DISTRIBUTIONS

Chapter 6: Discrete Probability Distributions 91


Overview
This chapter explains the concept of discrete probability distribution. The concepts discussed in this
chapter are as follows: random variable; discrete probability distribution; mean and variance of discrete
probability distribution; binomial distribution. The chapter concludes with a summary and a set of
exercises.

Objectives
After completing this chapter, you should be able to:
1. Construct a probability distribution for a discrete random variable.
2. Find the mean, variance, standard deviation and expected value for a discrete random variable.
3. Find probabilities using binomial distribution.
4. Find mean, variance, standard deviation for the variable of binomial distribution.

6.1 Introduction
In the last chapter, we discussed the concepts and rules of probability. This chapter extends the concept
of probability to explain probability distributions. We have seen that random experiment has more than
one outcome and it is impossible to predict which of the many possible outcomes will occur, if the
experiment is performed. In this chapter, we will see that if the outcomes and their probabilities for a
random experiment are known, we can find out what will happen, on average, if the random experiment
is performed many times.

This chapter explains random variable and types of random variables. Then the concept of probability
distribution, its mean and variance for a discrete random variable are discussed. In addition, a special
probability distribution called the binomial distribution is explained.

6.2 Random Variables


Before probability distribution is defined, we should review the definition of variable. Recall that, in
Chapter 1, we have discussed about variable, which was defined as a characteristic that can assume
different values. Variables are represented by the letters X, Y, or Z etc. Since the variables in this Chapter
are associated with probability, they are called random variables.

A random variable is a variable whose value is determined by the outcome of a random


experiment.

EXAMPLE 6−1

If two coins are tossed, then the sample space is


S = {HH, HT, TH, TT}.
Let the random variable X be the number of heads.

If we count the no. of heads in each outcome of the sample space, we have


 

S   HH , HT , TH , TT  .
   

 X 2 X 1 X 1 X 0 

Chapter 6: Discrete Probability Distributions 92


Then the values the random variable X can assume are 0, 1 or 2.

Types of Random Variables


A random variable can be categorized into discrete random variables or continuous random
variables.

i. Discrete Random Variables

Variables that assume values that are countable are called discrete variables. For example, the number
of students in a class, number of road accidents, etc.

ii. Continuous Random Variable

Variables that can assume all values in an interval are called continuous variables. Example weight of a
student in a class, price of a car, etc.

6.3 Discrete Probability Distribution


A discrete probability distribution consists of the values a discrete random variable can assume and its
corresponding probabilities.

EXAMPLE 6−2

In an experiment of rolling a single die, write the probability distribution of the number of dots.

SOLUTION
Let X be the number of dots on the die, and then the values X can assume are 1, 2, 3, 4, 5, 6. The
probability of each outcome is 1/6. Then the probability distribution of X is given by:

X 1 2 3 4 5 6

P (X) 1/6 1/6 1/6 1/6 1/6 1/6

EXAMPLE 6−3

In an experiment of tossing a coin 3 times, write the probability distribution of the number of heads.

SOLUTION

The sample space is S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}.
Let X be the number of heads, then the values X can assume are 0, 1, 2, and 3. Then the probability
distribution of X is given by

X 0 1 2 3

1 3 3 1
P (X)
8 8 8 8

Chapter 6: Discrete Probability Distributions 93


This probability distribution can be presented by the following bar graph, where a bar represents the value
of P (X) and the probability of each value of X is exhibited by the height of the corresponding bar.

EXAMPLE 6−4

In an experiment of rolling two dice, find the probability distribution of a random variable X that represents
the sum of outcomes.

SOLUTION

Refer to Example 4−2 of Chapter 4 for the sample space. When we sum the outcomes, the minimum
sum we get is 2 and the maximum we can get is 12. The values X can assume are 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12.

To find the probability corresponding to 2, we have to find out which outcome when added gives 2 and
there is only one outcome that is {(1, 1)} Since there are 36 outcomes altogether, P(2) = 1/36.

To find the probability corresponding to 3, we have to find out which outcome when added gives 3 and
there are two outcomes {(1, 2), (2, 1)} So, P(3) = 2/36 and so on.

The probability distribution of X is given by:

X 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 5 4 3 2 1
P (X)
36 36 36 36 36 36 36 36 36 36 36

Chapter 6: Discrete Probability Distributions 94


EXAMPLE 6−5

Two balls are drawn in succession without replacement from an urn containing 4 red balls and 3 black
balls. Find the probability distribution of a random variable the number of black balls.

SOLUTION

7
Selecting 2 balls from 7 can be done in C2 = 21 ways. Hence, S contains 21 sample points.
Here, X= the number of black balls = 0, 1, 2.
The probability of selecting 0 black balls (i.e. X = 0) is
C0  4C2 3
6 2
P( X  0)  7
 
C2 21 7
Similarly,
C1  4C1 12 4
3
P( X  1)  7  
C2 21 7

C2  4C0
3
3 1
P( X  1)  7
 
C2 21 7
The probability distribution of X is given by:

X 0 1 2

P (X) 2/7 4/7 1/ 7

Requirements of a Discrete Probability Distribution

1. The probability of each event in the sample space must be between or equal to 0 and 1. That is,
0  P  X   1.
2. The sum of the probabilities of all the events in the sample space must equal 1; that is,
 P  X   1.
EXAMPLE 6−6

Determine whether the following is a probability distribution.


A.
X −1 0 1 2
P( X ) 0.25 0.34 0.28 0.13

B.
X 0 1 2 3
P( X ) 0.08 0.11 0.39 0.27

Chapter 6: Discrete Probability Distributions 95


C.
X 0 2 4 6
P( X ) −1 1.5 0.3 0.2

SOLUTION

A. It is a probability distribution because it satisfies both requirements.


B. It is not a probability distribution because the sum of all probability is not equal to 1.
C. It is not a probability distribution because some probabilities are not between 0 and 1.

6.4 Mean, Variance and Standard Deviation of Discrete Distribution


Now you will learn how to compute the mean, variance, and standard deviation of a discrete probability
distribution.

6.4.1 The Mean


The mean (expected value) of a discrete random variable X, denoted by  or E ( X ) is given by
  E ( X )   X  P  X .
Note:
1. E ( X ) is read as expected value of the random variable X .
2.  X  P  X  means to multiply the value of the random variable with its corresponding
probability and then add the results.

6.4.2 The Variance and Standard Deviation


The variance of a discrete random variable X, denoted by  2 is given by:

 2    X 2  P  X    2 .

Note:
1.  X 2  P  X  means to multiply square of the value of the random variable to its corresponding
probability, and then add the results.
2. The standard deviation  is found by taking the square root of the variance.

EXAMPLE 6−7

Find the mean, variance and standard deviation of the probability distribution in Example 6–3.

SOLUTION

The probability distribution is:

X 0 1 2 3
1 3 3 1
P( X )
8 8 8 8

Chapter 6: Discrete Probability Distributions 96


   X  P X 
 0(1 / 8)  1(3 / 8)  2(3 / 8)  3(1 / 8)
 1.5

 2   X 2  P X   2
 02 (1 / 8)  12 (3 / 8)  22 (3 / 8)  32 (1 / 8)  1.52
 0.75.

  0.75  0.866.

EXAMPLE 6−8

In a gambling game, a man is paid $5 if he gets all heads or all tails when 3 coins are tossed but he has
to pay out $3 if either 1 or 2 heads show up. What is his expected gain?

SOLUTION

The sample space is given by S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}. Let X = the gain in the
game. Then the probability distribution of X is given by:

X $5 $ –3
P( X ) 2 6
8 8

Thus, the expected gain is


E ( X )   X .P ( X )
1 3
 5   (3) 
4 4
5 9
   1.
4 4
Hence, the gambler may lose $1, on average, in each try in the game.

Chapter 6: Discrete Probability Distributions 97


EXAMPLE 6−9

One thousand tickets are sold at $1 each for a color television valued at $350. What is the expected value
of the gain if a person purchases one ticket?

SOLUTION

Let X = the gain in the game. Then the probability distribution of X is given by:

X $349 $ –1
P( X ) 1 999
1000 1000

Thus, the expected gain is:


E ( X )   X .P ( X )
1 999
 349   (1) 
1000 1000
349 999
   0.65.
1000 1000

Hence, the person may lose $0.65, on average, in each try in the game.

6.5 The Binomial Distribution


Many problems in probability have only two outcomes or can be reduced to two outcomes. For example,
when a coin is tossed, the outcome can be a ‘head’ or ‘tail’. When a baby is born, it will be either ‘male’
or ‘female’. In a rugby game, a team either ‘wins’ or ‘loses’. A projectile is fired at a target; the outcome
may be either ‘hit the target’ or ‘miss the target’. There are situations, when the outcomes can be reduced
into two. For example, a multiple-choice question, even though there are four or five answer choices, can
be classified as ‘true’ or ‘false’. Situations like these are called binomial experiments.

6.5.1 Requirement of Binomial Experiments


A binomial experiment is a probability experiment that must satisfy the following four conditions:
1. Each trial can have only two outcomes or outcomes that can be reduced to two outcomes. These
outcomes can be considered as either a success or failure.
2. There must be a fixed number of trials (say, n trials). All trials are identical.
3. The outcomes of each trial must be independent of each other.
4. The probability of a success (p) must remain the same for each trial.

Chapter 6: Discrete Probability Distributions 98


EXAMPLE 6−10

Consider the experiment consisting of tossing a coin three times. Determine whether or not it is a binomial
experiment.

SOLUTION

The experiment satisfies all four conditions as follows:


i. Each toss (trial) has only two outcomes: a head or tail.
ii. The experiment has n = 3 fixed number of trials and they are all identical.
iii. The outcomes of each toss are independent of each other as the result of a succeeding toss is
not affected by the result of its preceding toss.
iv. The probability of obtaining a head (a success) is 1/2 and of a tail (a failure) is 1/2 for any toss.
That is,
p = P ( H ) = 1 / 2 and q = P ( T ) = 1 / 2.
The probability of a success is same that is 1/2 for each trial.

6.5.2 Binomial Probability Formula


For a binomial experiment, the probability of exactly X successes in n trials is given by the following
formula:

P( X )  n C X  p X  q n  X .
Where,
n: Number of trials
p: Probability of success in a trial
q: Probability of failure in a trial
X: Number of success in n trials
Note:
1. p  q  1.
2. X  0,1, 2, , n.
3. Binomial is a discrete distribution.

EXAMPLE 6−11

A coin is tossed three times. Find the probability of getting exactly two heads.
SOLUTION

Let X be the number of heads, with n = 3, X = 2, p = 1/2, and q = 1/2,


2 3 2
1 1
P(2)  C2    
3
 0.375 .
2 2

Chapter 6: Discrete Probability Distributions 99


EXAMPLE 6−12

If a student randomly guesses at five multiple-choice questions, find the probability that the student gets
exactly three correct. Each question has five possible choices.
SOLUTION

Let X be a correct answer. In this case: n = 5, X = 3, p = 1/5, and q = 4/5,


Therefore,
3 5 2
1  4
P(3)  C3    
5
 0.05 .
5  5

EXAMPLE 6−13

A survey from Teenage Research Unlimited found that 30% of teenage consumers receive their spending
money from part-time jobs. If five are selected at random, find the probability that at least three of them
will have part-time job.
SOLUTION

Let X be the number of consumers having part-time job. In this case: n = 5, X = 3, 4, or 5, p = 0.3, and q
= 0.7, Therefore,
P (3) = 5C3 (0.3)3 (0.7)2 = 0.132
P (4) = 5C4 (0.3)4 (0.7)1 = 0.028
P (5) = 5C5 (0.3)5 (0.7)0 = 0.002

Hence,
P (X > 3) = P(3) + P(4) + P(5)
= 0.132 + 0.028 + 0.002 = 0.162.
The above example indicates that the binomial probability formula can be tedious at times. Therefore,
binomial tables have been developed for selected values of n and p to overcome this tiresome task.
Please refer to the Eton statistical tables.

EXAMPLE 6−14

If 30% of the people in a community use the library in one year, for a sample of 15 people find
probabilities:
A. Exactly 7 used the library.
B. At least 5 used the library.

Chapter 6: Discrete Probability Distributions 100


SOLUTION

Using binomial formula mainly in part B will be very time consuming so we make use of the binomial
tables.

A. n = 15, p =  = 0.3 and X = 7, we get P (7) = 0.0811.


B. P (X > 5) = 1 — P(X < 4) =

= 1 — (0.0047 + 0.0305 + 0.0916 + 0.1700 + 0.2186)


= 0.4846.

EXAMPLE 6−15

The probability that a patient will die whilst having a particular type of heart operation is 0.40. If 10 patients
decided to have this particular type of heart operation, what is the probability that:
A. 2 will die,
B. Almost 3 will die,
C. At least 5 will die

SOLUTION

Let X = the no. of patients will die.


Here, n = the no. of patients decided to have operation = 10.
p= probability that a patient will die = 40% = 0.4.

Using the binomial table in Eton tables


A. P(2 will die) = P( 2) = 0.1209.
B. P(at most 3 will die) = P(X < 3)
= P (0) + P (1) + P (2) + P (3)
= 0.006 + 0.0403 + 0.1209 + 0.215 = 0.3822.
C. P(at least 5 will die) = P(X > 5)
= P (5) + P (6) + P (7) + P (8) + P (9) + P (10)
= 0.2007 + 0.1115 + 0.0425 + 0.0106 + 0.016+ 0.0001 = 0.367.
Alternatively,
P (at least 5 will die) = P(X > 5)
= 1 − P(X < 4)
= 1 – [P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4)]
= 1 − [0.006 + 0.0403 + 0.1209 + 0.215+ 0.2508] = 0.367.

Chapter 6: Discrete Probability Distributions 101


EXAMPLE 6−16

It is found that 75% of the patients suffering from a particular disease are cured successfully. What is the
probability that 3 of the next 4 patients will be cured successfully?

SOLUTION

Let X = the no. of patients cured successfully.


Here n = 4, p = 0.75, and X = 3.

We can’t use the tables straightaway since  = 0.75, is not in the tables. So to use the table we have
to change X to Y using Y = n — X and  to 1 using 1  1   . Therefore, lookup the table with n  4,
Y  1 and 1  0.25. We get P (X = 3) = P (Y = 1) = 0.4219.

6.5.3 Mean Variance and Standard Deviation of the Binomial Distribution


The mean, variance, and standard deviation of a variable that has the binomial distribution can found by
using the following formulas.

Mean    np
Variance   2  npq
Standard Deviation    npq 

EXAMPLE 6−17

A coin is tossed 4 times. Find the mean, variance, and standard deviation of the number of head that will
be obtained.

SOLUTION
Here n=4, p=1/2, and q=1/2 and using the formulas, we have

  n  p  4  (1 / 2)  2
 2  n  p  q  4  (1 / 2)  (1 / 2)  1
  1 1

Alternatively, this problem can be solved using expected value formula.

X 0 1 2 3 4
P(X) 1 4 6 4 1
16 16 16 16 16

Chapter 6: Discrete Probability Distributions 102


  E ( X )   x P( X  x)
x

1  4  1
 0    1   ...  4    2.
 16   16   16 

 2   x 2 P( X  x)   2
x

1  4 1
 02    12    ...  42    22  1
 16   16   16 

and  = 1  1.

6.6 Summary
In this chapter, we examined random variables and discrete probability distribution. The concepts
discussed in this chapter were: random variables; discrete probability distribution; mean, variance and
standard deviation of discrete probability distribution. Later, we also discussed of a common discrete
distribution that is the binomial distribution and used it to solve some probability problems.

Chapter 6: Discrete Probability Distributions 103


EXERCISES

1. The probability density function of a discrete random variable Y is given by P(Y = y) = cy2 for y = 0,
1, 2, 3, 4. Given that c is a constant, find the value of c.

2. The following is the probability distribution of the X the number of breakdowns per week for a machine
based on past data.

X 0 1 2 3
P (X) 0.15 0.20 0.35 0.3

Find the probability that the no. of breakdowns for this machine during a given week is:
A. Exactly 2
B. At least 2
C. At most 1

3. Find the mean, variance and standard deviation of the probability distribution in question 2.

4. According to an internet posting, 80% of adults enjoy drinking beer. Three adults are randomly
selected, and let X, be the number of adults who enjoyed drinking beer:
A. Obtain the probability distribution of X.
B. Calculate the expected value and standard deviation of X.

5. Joe is playing a game of chance at the Hibiscus festival, costing $1 for each game. In the game two
fair dice are rolled and the sum of the numbers that turn up is found. If the sum is seven, then Joe
wins $5 otherwise, Joe loses his money. Joe plays the game 15 times. Find his expected gain or loss.

6. Eight people applied for a job as assistant manager of restaurants. Five have completed college and
three have not. If a manager selects three applicants at random, construct a probability distribution
for selecting those that have completed college.

7. A shoe store’s records show that 30% of customers making a purchase use a credit card to make
payment. This morning, 20 customers purchased shoes from the store. Find the probability that at
least 2 of the customers used a credit card. (Assume independence).

8. The editor of a journal historically accepts 11 % of articles submitted for publication. Using the
binomial formula, find the probability that in a random sample of 8 articles submitted to this journal,
the editor will accept:
C. Exactly 4 for publication.
D. At least one for publication.

9. If 3% of calculators are defective, find the mean, variance and standard deviation of a lot of 400
calculators.

10. A fisherman finds that approximately 17% of all his fish go bad by the time he takes them to the
market. The fisherman catches 1,000 fish.
A. How many will go bad by the time he takes them to the market?
B. Find the standard deviation.

Chapter 6: Discrete Probability Distributions 104


CHAPTER 7:

THE NORMAL DISTRIBUTION

Chapter 7: The Normal Distribution 105


Overview
This chapter discusses the normal distribution. The concepts discussed in this chapter are as follows: the
normal distribution; standard normal distribution; applications of normal distribution; the central limit
theorem. The chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. List the properties of a normal distribution.
2. Find the area under the standard normal distribution given the z – values.
3. Find the probabilities for a normally distributed random variable.
4. Find specific data values for given percentage, using standard normal distribution.
5. Use the central limit theorem to solve problems involving sample means for large sample.

7.1 Introduction
Random variables can either be discrete or continuous. Discrete random variables and their distributions
were discussed in Chapter 5. We have also examined the binomial distribution and its properties. Recall
that discrete random variables are those that are countable, on the other hand, a continuous random
variable can assume all values in an interval. Examples of continuous variables are heights of students,
body temperature of dogs and blood pressure of adults. Since continuous random can assume any value
in an interval, say 0 to 1 year, if the life of the bulb is 1 year. This interval contains an infinite numbers of
values that are uncountable.

Many continuous random variables have distributions that are bell–shaped and are called approximately
normally distributed variables. In this chapter, we will study a special continuous distribution called the
Normal distribution. Finally, this Chapter also explains a very important fact about a normal distribution
called central limit theorem.

7.2 The Normal Distribution


Normal distribution is also known as a bell curve or a Gaussian distribution, named for the German
mathematician Carl Friedrich Gauss (1777–1855), who derived its equation.

No variable fits a normal distribution perfectly, since a normal distribution is a theoretical distribution.
However, a normal distribution can be used to describe many variables that are approximately normal.

When the data values are evenly distributed about the mean, a distribution is said to be symmetric
distribution. (Normal distribution is symmetric). When majority of the data lies to the left or right of the
mean, the distribution is said to be skewed.

Chapter 7: The Normal Distribution 106


A normal distribution is a continuous, symmetric, bell shaped distribution of a variable (Quora, 2016)

7.2.1 Properties of Normal Distribution


1. Normal distribution curves are bell shaped, continuous and symmetric about the mean.
2. The mean, median and mode are equal and located at the center of the distribution.
3. Normal distribution curve is unimodal.
4. The curve never touches the x axis.
5. The total area under the normal curve and x-axis is always equal to 1 square unit.
6. The area under the normal curve that lies within 1 standard deviation of the mean is 68%, within
2 standard deviation of the mean is 95% and within 3 standard deviation of the mean is 99.7%.
This is called the empirical rule.

Chapter 7: The Normal Distribution 107


7.2.2 Standard Normal Distribution
Finding the area under a normal curve is difficult, so statisticians use a standard normal curve to find this
area.

A standard normal distribution is a normal distribution with a mean of 0 and a standard


deviation of 1.

A normally distributed variable X , can be transformed into the standard normally distributed variable z,
by using the formula for the z-score:
X 
z ,

Where,
X = data value
 = population mean
 = population standard deviation

Steps to find Area under the Standard Normal Curve


Step 1: Draw the standard normal curve and shade the area desired.

Step 2: Use the Eton table to find the area.

EXAMPLE 7−1

Find the area on the left of z = 1.99.

SOLUTION

Step 1: Draw a standard normal curve and shade the area on the left of 1.99.

Step 2: Look for z = 1.99 in the Eton table and we get 0.4767. The area 0.4767 obtained from the table
is the area under the curve from 0 to 1.99. Since the area on the left of 0 is 0.5, the area desired is 0.5 +
0.4767 = 0.9767. The area on the left of 1.99 can also be written as P (z <1.99) = 0.9767 and is read as
probability that z is less than 1.99 is 0.9767 or 97.67%.

Chapter 7: The Normal Distribution 108


EXAMPLE 7−2

Find the area on the left of z = −1.16.

SOLUTION

Step 1: Draw a standard normal curve and shade the area on the right of −1.16.

Step 2: Since z = −1.16 is not in the Eton table, look for z = 1.16 and we get 0.3830. The area 0.3830
obtained from the table is the area under the curve from −1.16 to 0. Since the area on the right of 0 is
0.5, the area desired is 0.5 + 0.3830 = 0.8830. P (z > −1.16) = 0.8830.

EXAMPLE 7−3

Find the area on the left of z = −1.37 and z = 1.68.

SOLUTION

Step 1: Draw a standard normal curve and shade the area between z = −1.37 and z = 1.68.

Step 2: Look for z = 1.37 and we get 0.4147. The area 0.4147 obtained from the table is the area under
the curve from −1.37 to 0. Then look for z = 1.68 and we get 0.4535. The area 0.4535 obtained from the
table is the area under the curve from 0 to 1.68. So the area desired is 0.4147 + 0.4535 = 0.8682.
P (−1.37 < z < 1.68) = 0.8682.

Chapter 7: The Normal Distribution 109


EXAMPLE 7−4

Find the probability P (z > 1.91).

SOLUTION

Step 1: Draw a standard normal curve and shade the area between z = 1.91.

Step 2: Look for z = 1.91 and we get 0.4791. The area 0.4719 obtained from the table is the area under
the curve from 0 to 1.91. Since the area on the right of 0 is 0.5, so the area desired is 0.5 −0.4791 =
0.0281. P ( z < 1.91) = 0.0281.

EXAMPLE 7−5

Find the z value such that the area under the standard normal curve between 0 and the z value is 0.2157.

SOLUTION

Step 1: Draw a standard normal curve and shade the area between 0 and the z value to be 0.2157.

0.2157

z=0 z

Step 2: Since the area between 0 and the z value is 0.2157, then look for 0.2157 in the probability
section of the table. The z value corresponding to 0.2157 is 0.57. Therefore, the z value is 0.57. See the
diagram below.

Chapter 7: The Normal Distribution 110


EXAMPLE 7−6

Find the z value such that the area under the standard normal curve on the right of the z value is 0.0239.

SOLUTION

Step 1: Draw a standard normal curve and shade the area on the right of the z value to be 0.0239.

0.0239

0 z

Step 2: Find the area between 0 and the z value, which will be 0.5 − 0.0239 = 0.4761. Then look for
0.4761 in the probability section of the table. The z value corresponding to 0.4761 is 1.98. Therefore,
the z value is 1.98.

7.3 Applications of Normal Distribution


The standard normal distribution curve can be used to solve a wide variety of practical problems if the
variable is approximately normally distributed. There are various ways we can check whether the variable
is approximately normally distributed or not.

To solve the application problems, we need to know how to find the probability given the z value or find
z value given the probability.

Chapter 7: The Normal Distribution 111


EXAMPLE 7−7

The average annual salary for all U.S teachers is $47750. Assume that the distribution is normally
distributed and the standard deviation is $5680. Find the probability that a randomly selected teacher
earns
A. Between $35000 and $45000 a year.
B. More than $40000 a year.

SOLUTION

Let, X = annual salary of a teacher which is normally distributed with µ = 47750 and σ = 5680.

A. This probability can be written as P (35000 < X < 45000). There are two X values here, 35000 and
45000. Now convert the two X values into z using the formula:
For X  35000, z  35000  47750  2.24.
5860
For X  45000, z  45000  47750  0.48.
5860

So P (35000 < X < 45000) = P (−2.24 < z < −0.48) Now draw a standard normal curve and shade
the area between −2.24 and -0.48.

−2.24 −0.48

Look for z = 2.24 in the tables and we get 0.4875. The area 0.4875 obtained from the table is the
area under the curve from −2.24 to 0. Now look for z = 0.48 and we get 0.1844. The area 0.1844
obtained from the table is the area under the curve from -0.48 to 0. So the area desired is 0.4875 –
0.1844 = 0.3031 P (35000 < X < 45000) = 0.3031 or 30.31%.

B. This probability can be written as P (X < 40000) Now convert the X value into z using the formula:
40000  47750
X  40000, z   1.36.
5860

So P (X < 40000) = P (z > −1.36.) Now draw a standard normal curve and shade the area on the
right of −1.36.

Chapter 7: The Normal Distribution 112


Look for z = 1.36 and we get 0.4131. The area 0.4131 obtained from the table is the area under the
curve from −1.36 to 0. So the area desired is 0.5 + 0.4131 = 0.9131 P (X > 40000) = 0.9131 or
91.31%.

EXAMPLE 7−8

A certain type of storage battery lasts, on the average, 3.0 years with a standard deviation of 0.5 years.
Assuming that the battery lives are normally distributed, find the probability that a given battery will last
less than 2.3 years.

SOLUTION

Let, X = the number of years a battery lasts, which is normally distributed with  = 3.0 and  = 0.5. This
probability can be written as P (X < 2.3). Now convert the X value into z using the formula:
2.3  3
X  2.3, z   1.4.
0.5
So P (X < 2.3) = P (z < −1.4). Now draw a standard normal curve and shade the area on the left of −1.4.

−1.4

Look for z = 1.4 and we get 0.4192. So the area desired is 0.5 – 0.4192 = 0.0808. Therefore, the
probability that a given battery will last less than 2.3 years is 0.0808 or 8.08%.

EXAMPLE 7−9

An electrical firm manufactures light bulbs that have a length of life that is normally distributed with mean
equal to 800 hours and a standard deviation of 40 hours. Find the probability that a bulb burns between
778 and 834 hours.

SOLUTION

Let, X = length of life of a bulb, which is normally distributed with  = 800 and  = 40. This probability
can be written as P (778 < X < 834). Converting the X values into z we get −0.55 and 0.85.
So P (778 < X < 834) = P (−0.55 < z <0.85). The area desired is:

−0.55 0.85

Therefore, P (778 < X < 834) = 0.2088 + 0.3023 = 0.5111.

Chapter 7: The Normal Distribution 113


EXAMPLE 7−10

The time taken by the milkman to deliver to the High Street is normally distributed with a mean of 12
minutes and a standard deviation of 2 minutes. He delivers milk every day. Estimate the number of days
during the year when he takes:
A. Longer than 17 minutes.
B. Less than 10 minutes.

SOLUTION

Let, X be the time, in minutes, taken to deliver milk to the high street, which is normally distributed with
 = 12 and  = 2.

A. We have to find P (X > 17). Converting 17 into z value we get 2.5. So P (X > 17) = P (z >2.5). The
area desired is:

2.5
P (X > 17) = 0.5 – 0.4938 = 0.0062. To find the number of days, multiply by 365.
365 × 0.0062 = 2.263 ≈ 2.Therefore, on two days in a year he takes longer than 17 minutes.
B. We have to find P (X < 10). Converting 10 into z value we get −1. So P (X < 10) = P (z < −1). The
area desired is:

−1

P (X < 10) = 0.5 – 0.3413 = 0.1587. Now 365 × 0.1587 = 57.92 ≈ 58.
Therefore, on 58 days in a year he takes longer than 10 minutes.

Chapter 7: The Normal Distribution 114


EXAMPLE 7−11

An IQ test is normally distributed with mean of 400 and standard deviation of 100. The top 3% of students
receive $500 as the prize money. What is the minimum score one would need to receive this award?

SOLUTION

Step 1: Draw a standard normal curve and shade the area on the right of the z value to be 0.03.

0.03

Step 2: Find the area between 0 and the z value, which will be 0.5 − 0.03 = 0.47. Then look for 0.47 in
the probability section of the table. We don’t have 0.47 so use the closest value that 0.4699. The z
value corresponding to 0.4699 is 1.88. Now use z formula to find the value of X:

X  400
1.88  ,
100

Making X the subject, we get X = 588. Thus, anyone scoring 558 or more must be qualified.

EXAMPLE 7−12

For a medical study, a researcher wishes to select people in the middle 60% of the population based on
blood pressure. If the systolic blood pressure is normally distributed with the mean of 120 and the
standard deviation is 8, find the upper and lower readings that would qualify people to participate in the
study.

SOLUTION

Step 1: Draw a standard normal curve and shade the middle area to be 60%
60%

z0 0 z1

Chapter 7: The Normal Distribution 115


Step 2: Find the area between 0 and the z1 value, which will be 0.3. Then look for 0.3 in the probability
section of the table. We do not have 0.3 so use the closest value that 0.2996. The z value corresponding
to 0.2996 is 0.84. So z1 = 0.84 and z0 =−0.84 because the graph is symmetric. Now use the z formula to
find the values of X, since there are two values of z there will be two values of X.

X  120
0.84  ,
8
Therefore, the two values of X are 113.28 and 126.72. Thus, the lower reading is 113.28 and upper
reading is 126.72.

EXAMPLE 7−13

The weights of boxes of oranges are normally distributed such that 30% of them are greater than 4kg
and 20% are greater than 4.53kg. Estimate the mean and standard deviation of the weights.

SOLUTION

We are given that P(X >4) = 0.3 and P(X >4.53) = 0.2. Using this we have to find the values of  and
 . Lets first consider P(X >4) = 0.3. Converting 4 into z value we get 4   , so we have

 4 
P( X  4)  P  z   0.3. Now draw a standard normal curve and shade the area on the right
  
of 4   to be 0.3.

0.3

4

Using the tables we obtain the z value to be 0.52. Therefore, we have the equation
4
0.52  . (1)

Similarly using P(X >4.53) = 0.2 so we get another equation
4.53  
0.84  . (2)

Solving the equations (1) and (2) simultaneously we get µ = 3.12 kg and σ = 1.68 kg.

Chapter 7: The Normal Distribution 116


7.4 The Central Limit Theorem
We have discussed about probability distribution and now we will extend the concept of probability
distribution to that of sampling distribution. Before we discuss about sampling distribution lets recall
the terms statistic and parameter. A statistic is a numerical measure computed for sample data, for
example sample mean and sample standard deviation. On the other hand, the same numerical measures
computed for population data are called parameter. A statistic is a random variable and therefore it has
a probability distribution. The probability distribution of a statistic is commonly called its sampling
distribution. In this section, we will discuss the sampling distribution of the sample mean.

7.4.1 The sampling distribution of Sample Mean ( X )

A sampling distribution of sample means is a distribution using the means computed from
all possible random samples of a specific size taken from a population.

If the samples are randomly selected with replacement, the sample means will be somewhat different
from the population mean. These differences are caused by sampling error.

Sampling error is the difference between the sample measure and the corresponding
population measure due to the fact that the sample is not a perfect representation of the
population.

7.4.2 Properties of the sampling distribution of the Sample Mean


1. The mean of the sample means, denoted by  X , will be same as the population mean, that is:
 X  .
2. The standard deviation of the sample means, denoted by  X , will be equal to the population
standard deviation divided by the square root of the sample size, that is:

X  .
n

The following example illustrates these two properties. Suppose a lecturer gave an 8-point quiz to a small
class of four students. The results of the quiz were 2, 6, 4, and 8. For the sake of discussion, assume
that the four students constitute the population. The mean of the population is µ = 5 and the standard
deviation of the population σ = 2.236.

Now, if all samples of size 2 are taken with replacement and the mean of each sample is found, the
distribution is as shown.

Chapter 7: The Normal Distribution 117


Sample Mean ( X ) Sample Mean ( X )
2,2 2 6,2 4
2,4 3 6,4 5
2,6 4 6,6 6
2,8 5 6,8 7
4,2 3 8,2 5
4,4 4 8,4 6
4,6 5 8,6 7
4,8 6 8,8 8

Using the table above, find the mean of the values in the 2 nd and the 4th column, therefore  X  5. This
is same as the population mean, hence  X  .
The standard deviation of sample means, we have to find the standard deviation of the values in the 2nd
and the 4th column, so  X  1.581.  X is same as the population standard deviation, divided by 2.

The third property of the sampling distribution of sample means is on the shape of the distribution and is
explained by the central limit theorem.

7.4.3 The Central Limit Theorem


As the sample size n increases without limit, the shape of the distribution of the sample means taken
with replacement from a population with mean  and standard deviation  will approach a normal

distribution with a mean of  and a standard deviation of  / n.


If the sample size is sufficiently large, the central limit theorem can be used to answer questions about
sample means in the same manner that a normal distribution can be used to answer questions about
individual values. The only difference is that a new formula must be used for the z values. It is:

X 
z .
 n

EXAMPLE 7−14

The average teacher’s salary in Fiji is $29,863. Suppose that the distribution is normal with standard
deviation of $5100.
A. What is the probability that a randomly selected teacher’s salary is less than $40,000?
B. What is the probability that the mean for a sample of 80 teacher’s salary is greater than $30,000?

Chapter 7: The Normal Distribution 118


SOLUTION

Let, X be the salary of a teacher, which is normally distributed with  = 29863 and  = 5100.
A. We have to find P (X < 40,000). Converting 40000 into z value we get 1.99. So P (X < 40,000) = P (z
<1.99). The area desired is:

0 1.99 z

So P( X  40,000)  0.5  0.4767  0.9767 .

B. We have to find P( X  30,000). Since the variable X is normally distributed, the sample mean
X will have a normal distribution.

The z value of X  30000 is


30000  29863
z  0.24.
5100 80
So P( X  30,000)  P( z  0.24). The area desired is:

0 0.24 z

So P( X  30,000)  0.5  0.0948  0.4052.

EXAMPLE 7−15

It is reported that children between 2 and 5 years old watch an average of 25 hours of TV per week.
Assume the variable is normally distributed and the standard deviation is 3 hours. If 20 children between
the ages of 2 and 5 are randomly selected, find the probability that the mean of the number of hours
they watch TV will be greater than 26.3 hours.

SOLUTION

We have to find P  X  263 . Since the variable X is normally distributed, the sample mean X will
have a normal distribution.

Chapter 7: The Normal Distribution 119


The z value of X  26.3 is
263  25
z  194 .
3  20
Hence, P  X  263  P  z  1.94   05  04738  00262.

Therefore, the probability that the mean of the number of hours they watch TV will be greater than 26.3
hours is 0.0262.

EXAMPLE 7−16

The average age of a vehicle registered in Fiji is 8 years, or 96 months. Assume the standard deviation
is 16 months. If a random sample of 36 vehicles is selected, find the probability that the mean of their age
is between 90 and 100 months.

SOLUTION
We have to find P(90  X  100). The variable X is not normally distributed, but since the sample size
is more than 30, the sample mean X will have normal distribution.
The z value of X  90 is
90  96
z  2.25 .
16  36
The z value of X  100 is
100  96
z  1.5.
16  36
Hence, P(90  X  100)  P(2.25  z  1.5)  04878  0.4332  0.921.

2.25 0 1.5
Therefore, probability that the mean of their age is between 90 and 100 months is 0.921.

Chapter 7: The Normal Distribution 120


7.5 Summary
This chapter explained the normal distribution. The concepts discussed in this chapter were the normal
distribution and standard normal distribution with their properties; how to find the area under standard
normal curves; how to find z values given the area under standard normal curves and the applications of
normal distribution. The chapter also discussed the sampling distribution of sample mean using the
central limit theorem.

EXERCISES

1. Find the probabilities for each (where Z is a standard normal variable):


A. P(0.21  Z  1.57)
B. P(Z  1.43)

2. Find the value of Z O (where Z is a standard normal variable) in the following:


A. P(Z  Z0 )  0.1234
B. P(1.2  Z  Z0 )  0.8671

3. In a test, the average score of the 385 students is 65, with a standard deviation of 10. Assume the
scores are normally distributed:
A. What percentage of students scored between 47 and 67?
B. What percentage of students scored 86 or greater?

4. Daily sales of petrol from the Nabua service station are normally distributed with mean 6300L and
the standard deviation 400L.
A. If a daily sale is selected at random, find the probability that it is less than 6200L.
B. If petrol sales are sampled for 40 days, and the mean is calculated, find the probability that the
sample mean is less than 6200L.

5. The marks of ST130 exam is known to be normally distributed, with a mean 51 and standard deviation
14. If 200 students take the test,
A. How many would you expect to score between 58 and 65?
B. If 5% of the students get an A+, what is the minimum mark for an A+?

6. The weights of boxes of oranges are normally distributed such that 30% of them are greater than 4kg
and 20% are greater than 4.53kg. Estimate the mean and the standard deviation of the weights.

7. It is reported that children between 2 and 5 years old watch an average of 25 hours of TV per week.
Assume the variable is normally distributed and the standard deviation is 3 hours. If 20 children
between the ages of 2 and 5 are randomly selected, find the probability that the mean of the number
of hours they watch TV will be greater than 263 hours.

Chapter 7: The Normal Distribution 121


CHAPTER 8:

CONFIDENCE INTERVALS AND


SAMPLE SIZE

Chapter 8: Confidence Intervals and Sample Size 122


Overview
This chapter explains how to construct confidence interval and determine minimum sample size. The
concepts discussed in this chapter are as follows: confidence interval for population mean and population
proportion; minimum sample size needed in population mean and population proportion estimation. The
chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Find the confidence interval for the mean when the population standard deviation is known.
2. Determine the minimum sample size for finding a confidence interval of mean.
3. Find the confidence interval for the mean when the population standard deviation is unknown.
4. Find the confidence interval for the population proportion.
5. Determine the minimum sample size for finding a confidence interval of proportion.

8.1 Introduction
As part of inferential statistics, we need to determine the value of the population parameters. This is not
possible since the population is large, so statisticians have to estimate the value of the parameter. An
important aspect on inferential statistics is estimation, which is the process of estimating the true value
of a population parameter from the information derived from a small sample. For instance, the population
mean (  ) can be estimated using the sample mean ( X ).

Therefore, in this chapter, we will explain statistical procedures for estimating the population mean and
proportion. Another important question in estimation is that of sample size. How large should the sample
be drawn in order to make an accurate estimate? This question is not easy to answer as it depends on
several factors, such as the accuracy desired and the probability of making a correct estimate. The
problem of determining the sample size for estimating the parameters will also be discussed in this
chapter.

8.2 Estimation
Estimation is the process of estimating the true value(s) of a population parameter from the information
derived from a small sample.

Consider the following statements:


 25% of Americans is currently dieting.
 The average good IT graduate makes $32,786 a year.
These above values are only estimates of the true parameters and are derived from data collected from
samples.

8.2.1 Properties of a Good Estimator


A good estimator has the following three properties:
A. Unbiased. It should be unbiased. That is, the expected value or the mean of the estimates obtained
from samples of a given size is equal to the parameter being estimated. For example the sample
mean is an unbiased estimator of the population mean since  X  .

Chapter 8: Confidence Intervals and Sample Size 123


B. Consistent. It should be consistent, as sample size increases, the value of the estimator approaches
the value of the parameter estimated.
C. Relatively efficient. All the statistics that can be used to estimate a parameter, the relatively efficient
estimator has the smallest variance.

8.2.2 Types of Estimates


There are two types of estimate:
i. A Point Estimate is a specific numerical value estimate of a parameter.
ii. An Interval Estimate is an interval or range of values used to estimate a parameter.

Confidence Interval
A confidence interval is a specific interval estimate of a parameter determined by using the data obtained
from sample and a specific confidence level.
Confidence Level
A Confidence Level of an interval estimate of a parameter is the probability that the interval estimate will
contain the parameter.

EXAMPLE 8−1

Suppose that a 90% confidence interval states that the population mean is greater than 100 and less
than 200. How would you interpret this statement?

SOLUTION

It means that we are 90% confident that the interval contains the true population mean.

8.3 Confidence Intervals and Sample Size for the Mean when  is
known
Before constructing the confidence interval for  , it is essential to know the following:
 Is the distribution of the population normal or not?
 Is the population standard deviation known or unknown?
 Is the sample size large or small?
Our answers will then determine how to proceed. In this section we are going to construct the confidence
interval of the population mean when  is known.

8.3.1 Formula for the Confidence Interval


The confidence interval for  when  is known is given by:

 
X  z 2    X  z 2 .
n n

Chapter 8: Confidence Intervals and Sample Size 124


Where, z 2 is the value of z , which leaves an area of  2 to the right, which is shown in following
figure.

Note:
1. If n < 30 the population should be normally distributed.
2. The values of z 2 for some confidence interval are as follows:
 For the 99% confidence interval, z 2 = 2.58.
 For the 95% confidence interval, z 2 = 1.96.
 For the 90% confidence interval, z 2 = 1.65.

However, other values for confidence level could be given, so how do we find the value of z 2 . Let us
consider the next example.

EXAMPLE 8−2

Find the value of z 2 for a 98% confidence interval of mean.

SOLUTION

Draw a standard normal curve and shade the area 0.98 in the middle. See the graph below.

0.98

0 z 2
Use the standard normal table from the Eton tables to find the value of z 2 . Lookup 0.49 in the
probability section and read the corresponding z value. Therefore z 2 = 2.33.

Note: The value of α= 1 − 0.98 = 0.02 and α / 2 = 0/02 / 2 = 0/01. Therefore, the area on the right of
z 2 is 0.01.

Chapter 8: Confidence Intervals and Sample Size 125


EXAMPLE 8−3

A random sample of 49 shoppers showed that they spend an average of $23.45 per visit at MHCC
Bookstore. From past studies, it is known that σ = $2.80.
A. Find a point estimate of the population mean.
B. Find the 99% confidence interval of the true mean.

SOLUTION

A. A point estimate of the population mean is X  $23.45.


B. For 99% confidence interval, z 2  2.58. We have   $2.80 X  $23.45 and n  49 ,
then 99% confidence interval for  is
2.8 2.8
23.45  2.58    23.45  2.58
49 49
22.42    24.48
Hence one can say with 99% confidence that the average spending per visit at MHCC Bookstore is
between $22.42 and $24.48, based on a sample of 49 customers.

EXAMPLE 8−4

Suppose a registrar of the University of the South Pacific (USP) wishes to estimate the average number
of hours per day of distractions (phone calls, emails, impromptu visits, etc.) experienced by USP lecturers.
A study of random sample of 50 lecturers in USP found that the average distraction time is 1.8 hours per
day and the population standard deviation was 20 minutes. Estimate the true mean population distraction
time for USP lecturers with 90% confidence.

SOLUTION

For 90% confidence interval, z 2  1.65. We have σ = 0.33, X  1.8 and n = 50, then 99% confidence
interval for  is
0.33 0.33
1.8  1.65    1.8  1.65
50 50
1.72    1.88.
Hence one can say with 90% confidence that the average distraction time for a USP lecturer is between
1.72 and 1.88 hours per day, based on 50 lecturers.

Sample Size
Quite often, researchers need to know how large the sample is necessary to make an accurate estimate.
One may ask why sample size is so important. The answer to this is that an appropriate sample size is
required for validity. If the sample size is too small, it will not yield valid results. An appropriate sample
size can produce accuracy of results. Moreover, the results from the small sample size will be
questionable. A sample size that is too large will result in wasting money and time.

Chapter 8: Confidence Intervals and Sample Size 126


8.3.2 Formula for Minimum Sample Size
The formula for minimum sample size needed for an interval estimate of the population mean is:
 z 2 
2

n 
 E 
Where,
E is called the margin of error.

EXAMPLE 8−5

A pizza shop owner wishes to find the 95% confidence Interval of the true mean cost of a large plain
pizza. How large should the sample be if she wishes to be accurate to within $0.15? A previous study
showed that the population standard deviation of the price was $0.26.

SOLUTION

For 95% confidence interval, z 2  1.96. Here  = 0.26, E = 0.15 hence

 z 2     (1.96)(0.26) 
2 2

n   
 E   0.15 
 11.5.

Therefore, the minimum sample size should be 12 to estimate the population mean with 95%

EXAMPLE 8−6

A researcher in Fiji wishes to estimate within $300 the true average amount of money Fiji spends on road
repairs each year. The standard deviation is known to be $900. If she wants to be 90% confident, how
large a sample is necessary?

SOLUTION

For 90% confidence interval, z 2  1.65. Here  = 900, E = 300, hence

2
 (1.65)(900) 
n 
 300 
 24.5.

Therefore, the minimum sample size should be 25 to estimate the population mean with 90% confidence.

Chapter 8: Confidence Intervals and Sample Size 127


Confidence Intervals for the Mean when  is unknown
Recall, from 7.1 that when  is known, we are using the z-distribution to find the confidence interval of
population mean. If n < 30, then the population should be normally distributed.

However when  is unknown, a t-distribution is used. If n < 30, then the population should be normally
distributed.

If you still confused when to use z or t distribution, see the diagram below.

Yes Is No
Known?

Use values and Use values and


in the formula. * in the formula. *

* If n < 30, the variable must be normally distributed

8.4 Characteristics of the t-distribution


The t-distribution is similar to the standard normal distribution in the following ways:
 It is bell-shaped.
 It is symmetrical about the mean.
 The mean, median, and mode are equal to 0 and are located at the center of the distribution.
 The curve never touches the x axis.

However, it is different in the following ways:

 The variance is greater than 1.


 The t−distribution is actually based on the concept of degrees of freedom, denoted as d. f.,
which equals the sample size minus one.
 As the sample size increases, the t−distribution approaches the standard normal distribution.

8.4.1 Formula for the Confidence Interval


The confidence interval for µ when σ is unknown is given by:

s s
X  t 2    X  t 2 .
n n
The values for t 2 can be found from the t-distribution table from the Eton Tables.

Chapter 8: Confidence Intervals and Sample Size 128


EXAMPLE 8−7

Find the t 2 value for a 90% confidence interval of population mean, when the sample size is 20.

SOLUTION

For the 90% confidence interval, α= 0.10, thus α/2 = 0.05. Since n = 20, d. f. = 20 1 = 19, so look up the
t-distribution table with ν = 19, 2p = 0.1 and p = 0.05 and we get t 2 to be 1.729.

Note: In the table:


 v is the degrees of freedom ( d . f . )
 2 p is the area on both tails and is equal to  .
 p is the area on one tail and is equal to  / 2.
The t-distribution
2p
0.2 0.1 … 0.001

p 0.1 0.05 … 0.0005 120/
1 …
3.078 6.314 6.36.62
2


19 1.328 1.729 3.883

 1.282 1.645 … 3.291

EXAMPLE 8−8

For a group of 20 ST130 students subjected to a stress situation, the mean number of heart beats per
minute was 126, and the standard deviation was 4. Find the 95% confidence interval of the true mean.
Assume the variable is normally distributed.

SOLUTION

Since the population standard deviation,  is unknown, we use the t-distribution. For the 95%
Confidence Interval, α = 0.05 α/2 = 0.025 and the d.f. = 20 −1 = 19, so look up the t-distribution table
from the Eton tables with ν = 19, 2p = 0.05 and p = 0.025 and we get t 2 to be 2.093. Now the 95%
confidence interval is:
4 4
126  2.093    126  2.093
20 20

124    127.

Chapter 8: Confidence Intervals and Sample Size 129


EXAMPLE 8−9

A sample of 10 observations taken from a normal population produced the following data.

44 52 31 48 46 39 47 36 41 56

A. Find the point estimate of the population mean.


B. Find the 95% confidence interval for the population mean.

SOLUTION

From the data, we know that X  44 and s = 7.5.

A. The point estimate of  is X  44.

B. Similarly,  is unknown, so we use the t-distribution. Look up the t-distribution table with ν = 9,
2p = 0.05 and p = 0.025 and we get t 2 to be 2.262. Hence the 95% confidence interval is:

7.5 7.5
44  2.262    44  2.262
10 10

38.64    49.36.

8.5 Confidence Intervals and Sample Size for Proportion


We have discussed the confidence interval for a population mean, now we will find the confidence interval
for another population parameter called a proportion. The procedure to find the confidence interval and
the sample size for a proportion is similar to that for the population mean.

The population proportion, denoted by p , is the proportion of population units that possess a
characteristic. The population proportion is given by:
X
p ,
N

Where,
X is the number of population units that possess a characteristic
N is the population size
q = 1 –p, is the proportion of population units that do not possess a characteristic

For example, in the USP assessment meeting, the ST130 lecturer stated that 75% of ST130 students
pass the course last semester. The parameter 65% is a population proportion.

The population proportion, p , is often unknown, so a sample proportion, denoted as p̂ (read p hat) is
used to estimate it. It represents the proportion of sample units that possess a characteristic. The sample
proportion is given by:
x
pˆ  ,
n

Chapter 8: Confidence Intervals and Sample Size 130


Where,
x is the number of sample units that possess a characteristic
n is the sample size
qˆ  1  pˆ , is the proportion of sample units that do not possess a characteristic

EXAMPLE 8−10

In a study, 400 students were interviewed if they own a computer; 352 said that they had computers. Find
ˆ and qˆ.
p

SOLUTION

352
Here n  400, x  352, pˆ   0.88 and qˆ  1  0.88  0.22.
400
We can say that for this sample 88% of students surveyed owned a computer.

8.5.1 Sampling Distribution of Sample Proportion


1. The mean of the sample proportion, p̂ , denoted by  p̂ , is equal to the population proportion p
. That is,  p̂  p

2. The standard deviation of the sample proportion, p̂ , is denoted by  p̂ and is given by

pq
 pˆ  .
n
This formula is used when n  N  005.
3. By central limit theorem, the sampling distribution of pˆ , is approximately normal for a sufficiently
large sample size (that is np  5 and nq  5 ) with a mean of p and standard deviation of
pq
.
n
4. Therefore, the z -value of p̂ is given by
pˆ  p
z .
pq
n

Chapter 8: Confidence Intervals and Sample Size 131


8.5.2 Confidence Interval for Proportion
The confidence interval for a population proportion is given by:

ˆˆ
pq ˆˆ
pq
pˆ  z  2  p  pˆ  z  2 
n n

EXAMPLE 8−11

A recent study of 100 people in Fiji found 27 were obese. Find the 95% confidence of the population
proportion of all individuals living in Fiji who are obese.

SOLUTION

For 95% confidence interval, z 2  1.96. We have n  100, pˆ  27 / 100  0.27, and
qˆ  1  0.27  0.73 , then 95% confidence interval for p is:

(0.27)(073) (0.27)(073)
0.27  1.96  p  0.27  1.96
100 100

0.183  p  0.357.
Hence, one can be 95% confident that the proportion of people obese in Fiji is between 18.3% and 35.7%.

EXAMPLE 8−12

A survey of 120 female freshmen showed that 18 did not wish to work after marriage. Find the 90%
confidence interval of the true proportion of females who do not work after marriage.

SOLUTION

For 90% confidence interval, z 2  1.65. We have n  120, pˆ  18 /120  0.15, and qˆ  0.85 , then
90% confidence interval for p is:

(0.15)(085) (0.15)(085)
0.15  1.65  p  0.15  1.65
120 120

0.096  p  0.204.

Hence, we can say with 90% confident that between 9.6% and 20.4% of females do not work after
marriage.

Chapter 8: Confidence Intervals and Sample Size 132


8.5.3 Formula for Minimum Sample Size
The minimum sample size needed for interval estimate of a population proportion is:

2
z 
ˆ ˆ   2 
n  pq
 E 
Where,
E is called the margin of error

EXAMPLE 8−13

It is believed that 10% of Suva homes have a direct satellite television receiver (SKY Pacific). How large
a sample is necessary to estimate the true population of homes which do with 90% confidence and within
3 percentage points?

SOLUTION

For 90% confidence interval, z 2  1.65. Here pˆ  0.1, qˆ  0.9, E  0.03, hence
2
z 
ˆ ˆ   2 
n  pq
 E 
2
 1.65 
 (0.1)(0.9)    272.25.
 0.03 

Thus, a minimum sample size of 273 is required.

EXAMPLE 8−14

A researcher wishes to estimate the proportion of executives who own a car phone. She wants to be 99%
confident and be accurate within 5% of the true proportion. Find the minimum sample size necessary.

SOLUTION

For 99% confidence interval, z 2  2.58. In this problem, we have no prior knowledge of p̂ and so we
assign pˆ  0.5 and therefore, qˆ  05 . Hence,
2
 2.58 
n  (05)(05)   665.64.
 005 

Thus, the researcher needs interview at least 666 executives.

Chapter 8: Confidence Intervals and Sample Size 133


8.6 Summary
This chapter explains how to construct confidence interval and determine minimum sample size. The
concepts discussed in this chapter are as follows: confidence interval for population mean and population
proportion, minimum sample size needed in population mean and population proportion estimation.

EXERCISES

1. Explain the terms confidence level and confidence interval.

2. A recent survey of 8 social networking sites has a mean of 13.1 and a standard deviation of 4.1
million visitors for a specific month. Find the 95% confidence interval of the true mean. Assume that
the variable is normally distributed.

3. If the variance of a national accounting exam is 900, how large a sample is needed to estimate the
true mean score within 5 points and with 99% confidence?

4. The number of unhealthy days based on the AQI (air quality index) for a random sample of
metropolitan areas is shown:

61 12 6 40 27 38 93 5 13 40

A. What is the point estimate of the mean number of unhealthy days all such days?
B. Construct a 98% confidence interval of based on the data.

5. A sample of 30 networking sites for a specific month has a mean of 26.1. Assume the population
standard deviation to be 4.2. Find the 99% confidence interval of the true mean.

6. A recent study indicated that 29% of the 100 women over age 55 in the study were widows. How
large a sample must you take to be 90% confident that the estimate is within 0.05 of the true
proportion of women over age 55 who are widows?

7. A Tongan advertising agency wishes to estimate the proportion of household, which use a particular
brand of washing soap. They decide on the sample size of 500 and find that 157 households use the
product.
A. Construct a 99% confidence interval for proportion.
B. How large should a sample have to be for their interval estimate of proportion to have been in
error by 2%?

8. In a survey of drug use among 995 Suva teenagers, the following results were reported. Estimate
with 90% confidence the proportion of all Suva teenagers who are daily smokers or occasional
smokers.

Source Daily smokers Occasional smokers Ex-smokers Never smoked


Percentage (%) 21.7 7.4 31.2 39.7

Chapter 8: Confidence Intervals and Sample Size 134


CHAPTER 9:

HYPOTHESIS TESTING (PART I)

Chapter 9: Hypothesis Testing (Part I) 135


Overview
This chapter introduces the concept of hypothesis testing. The concepts discussed in this chapter are as
follows: statistical hypothesis; null and alternate hypothesis; statistical test; type I and type II error; level
of significance; critical and non-critical region; z−test for mean; methods of hypothesis testing. The
chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. State the null and alternate hypothesis.
2. Test means when population standard deviation is known, using the z-test.
3. Test hypothesis using a p-value method.

9.1 Introduction
Researchers are interested in answering many types of questions. For example, a scientist might want
to know whether the earth is warming up. A physician might want to know whether a new medication will
lower a person’s blood pressure. An educator might wish to see whether the new method of teaching is
better than a traditional one. Automobile manufacturers are interested in determining whether seat belts
reduce the severity of injuries caused in accidents. These types of questions can be addressed through
statistical hypothesis testing, which is a decision is making process for evaluating claims about a
population parameter.

In this chapter, we will discuss the basic concepts of hypothesis testing. We will also discuss the
hypothesis testing procedure for population mean using a z-test and the different methods of hypothesis
testing that are traditional, P-value and confidence interval method.

9.2 Concepts of Hypothesis Testing


In a test of hypothesis, we test a certain claim about a population parameter. We may want to find out,
using some sample information available, whether or not a given claim (or statement) about a population
parameter is valid. For example, a company called Bula soft–drinks ‘claims’ that on average, its bottle
contains 350 ml of soda. A government agency may want to test whether or not such bottle contains, on
average, 350 ml of soda. To test a sample of 100 bottles was investigated and it was found that the
mean amount of soda in these bottles was 347 ml. Based on this result, can we state that, on average,
all such bottles contain less than 350 ml of soda and that the company is lying to the public? Not until we
perform a test of hypothesis can we make such a claim.

9.2.1 Statistical Hypothesis


Every hypothesis-testing situation begins with the statement of a hypothesis. A statistical hypothesis is
an assumption about a population parameter. This assumption may or may not be true.

9.2.2 Type of Statistical Hypothesis


There are two types of statistical hypothesis for a situation:
1. Null hypothesis and
2. Alternative hypothesis.

Chapter 9: Hypothesis Testing (Part I) 136


1. Null hypothesis
The null hypothesis, denoted by H 0 , is a statistical hypothesis that states that there is no difference
between a parameter and a specific value, or that there is no difference between two parameters.

2. Alternative hypothesis
The alternative hypothesis, denoted by H1 , is a statistical hypothesis that states the existence of a
difference between a parameter and a specific value or states that there is difference between two
parameters.

How to write hypothesis?


As an illustration how hypothesis should be stated, we will look at three different cases.

CASE 9−1

A medical researcher is interested in finding out whether a new medication will have any undesirable side
effects. The researcher is particularly concerned with the pulse rate of the patients who take the
medication. Will the pulse rate increase, decrease, or remain unchanged after a patient takes the
medication? In the past, the mean pulse rate is 82 beats per minute.

In this case, the researcher wants to study whether mean pulse rate   82 or not. Therefore, the
hypothesis for this situation are:
H0 :   82
H1 :   82

This test is called a two-tailed test because there is a not equal sign in the alternate hypothesis.

Note: While writing the hypothesis, you should remember the following:
1. If you are testing   0 or   0 or   0 , it must go to null hypothesis H 0 and its
complementary i.e.   0 or   0 or   0 , respectively will go to H1 .

2. If you are testing   0 or   0 or   0 , it must go to H1 and its complementary i.e.


  0 or   0 or   0 , respectively will go to H 0 .

3. Write the H 0 as H0 :   0 always, in case your test is   0 or   0 .

CASE 9−2

A chemist invents an additive to increase the life of an automobile battery. The mean lifetime of the
automobile battery is 36 months. In this case you are testing   36 , so it goes to H1 and its
complementary i.e.   36 or   36 will go to H 0 . Therefore, the hypothesis for this situation are:
H0 :   36
H1 :   36

This test is called right-tailed because there is a greater than sign in the alternate hypothesis.

Chapter 9: Hypothesis Testing (Part I) 137


CASE 9−3

A contractor wishes to lower heating bills by using a special type of insulation in houses. If the average
of the monthly heating bill is $78, her hypothesis about heating costs with use of insulation are:

H0 :   $78
H1 :   $78

This test is a left-tailed test because there is a less than sign in the alternate hypothesis.

9.2.3 Statistical Test


After stating the hypothesis, the researcher selects the correct statistical test, chooses an appropriate
level of significance and formulates a plan for conducting the study.
 A statistical test uses the data obtained from a sample to make a decision about whether the
null hypothesis should be rejected.
 The numerical value obtained from a statistical test is called the test statistic value.

Two types of Error


The test of hypothesis enables us to decide whether a null hypothesis is to be accepted or rejected. Since
the decision is made on the basis of the information obtained from sample observation, there is a chance
of making an error. While making a decision, the following four situations can arise.

Decision on H 0
Nature of H 0 Reject H 0 Accept H 0

H 0 is true Wrong decision Correct decision


(Type I error)
H 0 is false Correct decision Wrong decision
(Type II error)

The above table shows that we are liable to commit the following two types of errors:
i. Rejection of the null hypothesis ( H 0 ) when it is true, is called a type I error.

ii. Acceptance of the null hypothesis ( H 0 ) when it is false, is called a type II error.

Chapter 9: Hypothesis Testing (Part I) 138


9.2.4 Level of Significance
The probability of committing a type I error (probability of rejecting H 0 when it is true) is called level of
significance and it is denoted by  . That is,

  P  Rejct H 0 H 0 is true  .

Note: The probability of committing a type II error is denoted by  . That is,

  P  Accept H 0 H 0 is false  .

9.2.5 Critical Region, Acceptance Region and Critical Value


After the level of significance is chosen, a critical value is selected from the tables for appropriate tests.
If a z-test is used, for example the z-table is looked up for the critical value.
 The region in which the test value falls is divided into two categories, the critical region and the
acceptance region.
 If the test value falls in the region for which the null hypothesis ( H 0 ) is rejected, the region is
called critical (or rejection) region.
 If the test value falls in the region for which the null hypothesis ( H 0 ) is accepted, the region is
called acceptance (or Non-critical) region.
 The value that separates the rejection region from the acceptance region is called the critical
value.

The critical value, critical region and noncritical region of a two-tailed test, a right-tailed test and a left
tailed test are shown in the following figures.

Source: Bluman (2012)

Source: Bluman (2012)

Chapter 9: Hypothesis Testing (Part I) 139


Source: Bluman (2012)

9.3 z-test for Mean


In this section, we will learn how test hypothesis concerning population mean. The z test is a statistical
test for the population mean when  is known for any sample size. (If n < 30, the variable must be
normally distributed).

9.3.1 Test Statistic


The value of test statistic is obtained by:

X 
z , when  is known.
 n

Step 1: State the hypothesis.


Step 2: Find the critical value(s).
Step 3: Compute the test statistic value.
Step 4: Make a decision to reject or do not reject null hypothesis.
Step 5: Summarize the results.

EXAMPLE 9−1

A survey claims that the average cost of hotel room in Fiji is $69.21. To test the claim, researcher selects
a sample of 30 hotel rooms and finds that the average cost is $68.43. The standard deviation of the
population is $3.72. At  = 0.05, is there enough evidence to reject the claim?
SOLUTION

We need to test whether,  = $69.21 (claim), which should be stated in null hypothesis.

Step 1: State the hypothesis


H0 :   $69.21(claim)
H1 :   $69.21

Chapter 9: Hypothesis Testing (Part I) 140


Step 2: Find the critical value.
Since 𝛼 = 0.05 and the test is two–tailed, find α / 2 = 0.05 /2 = 0.025. So the area on the left tail and the
right tail are 0.025. Draw a standard normal curve and find the z-values using the Eton tables. The z-
values are z = + 1.96. So the critical values are z = + 1.96. See the diagram below.

Critical Critical
region Acceptance region
region

−1.96 1.96

Step 3: Compute the test statistics value.

Given: n  30 , X  68.43 ,   69.21 and   3.72.


X  68.43  69.21
Therefore, z   1.15
 n 3.72  30
Step 4: Make a decision.
Since the test value z=-1.15, falls in acceptance region, the decision is: “Do not reject H 0 ”.
Step 5: Summarize the results.
There is not enough evidence to reject the claim that the average cost of hotel room in Fiji is $69.21

EXAMPLE 9−2

A researcher reports that the average salary of assistant professors (AP) is more than $42,000. A
sample of 30 AP has a mean salary of $43,260. At  = 0.05, test the claim that AP earn more than
$42,000/yr. It is known that  = $5,230.

SOLUTION

We need to test here,  > $42,000 (claim), which should be stated in alternative hypothesis.

Step 1: State the hypothesis


H0 :   $42,000
H1 :   $42,000(claim)

Chapter 9: Hypothesis Testing (Part I) 141


Step 2: Find the critical value.
Since 𝛼 = 0.05 and the test is right–tailed, so the area on the right tail is 0.05. Draw a standard normal
curve and find the z-value using the Eton tables. The critical value is z = 1.65. See the diagram below.

Critical
Acceptance region
region

1.65

Step 3: Compute the test statistics value


Given: n  30 , X  43260 ,   42000 and   5230 .
X  43260  42000
Therefore, z    1.32.
 n 5230 30

Step 4: Make a decision


Since the test value, z = 1.32 falls in acceptance region, the decision is: “Do not reject H 0 ”.

Step 5: Summarize the results.


There is not enough evidence to support the claim that APs earn more on average than $42,000.

EXAMPLE 9−3

A national magazine claims that the average college student watches less television than the general
public. The national average is 29.4 hours per week, with a standard deviation of 2 hours. A sample of
30 college students has a mean of 27 hours. Is there enough evidence to support the claim at  = 0.01?

SOLUTION

We need to test here,  < 29.4 (claim), which should be stated in alternative hypothesis.
Step 1: State the hypothesis
H0 :   29.4
H1 :   29.4(claim)

Step 2: Find the critical value.


Since 𝛼 = 0.01 and the test is left–tailed, so the area on the left tail is 0.01. Draw a standard normal curve
and find the z-value using the Eton tables. The critical value is z = −2.33. See the diagram below.

Chapter 9: Hypothesis Testing (Part I) 142


Critical
region
Acceptance
region

−2.33

Step 3: Compute the test statistics value


Given: n  30, X  27 ,   29.4 and   2 .
X  27  29.4
Therefore, z   6.57.
 n 2 30

Step 4: Make a decision


Since the test value, z = −6.57 falls in the rejection region, the decision is: “reject H 0 ”.

Step 5: Summarize the results.


There is enough evidence to support the claim that college students watch less television than the general
public.

9.4 Methods of Hypothesis Testing


The method used in the previous section is called traditional method. There are three methods of
hypothesis testing:

i. Traditional method,
ii. P –value method, and
iii. Confidence Interval method.

9.4.1 The P-value Method


In this section, we will discuss the P-value method of testing hypothesis.
 Many computer statistical packages give a P-value for hypothesis tests.
 The P-value is the smallest significance level at which the null hypothesis is rejected.
 If P  value   , we reject H 0 and if P  value   , we do not reject H0 .

Calculating P-value
The P-value is obtained from the standard normal curve as follows:
 If left-tail test, the P-value is the area on the left of the test value.
 If right-tailed test, the P-value is the area on the right of the test value.
 If two-tailed test, the P-value is twice the area on the left/right of the test value.

Chapter 9: Hypothesis Testing (Part I) 143


Steps in P-value Method
Step 1: State the hypothesis.
Step 2: Compute the test statistic value.
Step 3: Compute the P-value.
Step 4: Make a decision to reject or do not reject null hypothesis.
Step 5: Summarize the results.

EXAMPLE 9−4

A survey claims that the average cost of hotel room in Fiji is $69.21. To test the claim, researcher selects
a sample of 30 hotel rooms and finds that the average cost is $68.43. The standard deviation of the
population is $3.72. At  = 0.05, is there enough evidence to reject the claim? Use the P-value method.

SOLUTION

We need to test whether,  = $69.21 (claim), which should be stated in null hypothesis.
Step 1: State the hypothesis
H0 :   $69.21(claim)
H1 :   $69.21

Step 2: Compute the test statistics value


Given: n  30, X  68.43 ,   69.21 and   3.72.
X  68.43  69.21
Therefore, z    1.15 .
 n 3.72  30

Step 3: Compute the P-value.


Use the standard normal table from the Eton tables to find the area on the left of z = −1.15.

−1.15

Using the table, the area on the left of z = −1.15. is 0.1251. Since this is a two-tailed test, the P-value is
2(0.1251) =0.2502.

Chapter 9: Hypothesis Testing (Part I) 144


Step 4: Make a decision to reject or do not reject null hypothesis. Since the P-value is greater than 0.05,
the decision is “Do not reject H 0 ” .

Step 5: Summarize the results.


There is not enough evidence to reject the claim that the average cost of hotel room in Fiji is $69.21.

EXAMPLE 9−5

A researcher wishes to test the claim that the average age of lifeguards in Ocean City is greater than 24
years. She selects a sample of 36 guards and finds the mean of the sample to be 24.7 years and the
population standard deviation is assumed to be 2 years. Is there evidence to support the claim at  =
0.05? Use the P-value method.

SOLUTION

We need to test whether,  > 24 (claim), which should be stated in alternate hypothesis.

Step 1: State the hypothesis


H 0    24
H1    24 (claim)

Step 2: Compute the test statistics value


X  24.7  24
Therefore, z    2.10 .
 n 2  36

Step 3: Compute the P-value.


Use the standard normal table from the Eton tables to find the area on the right of z = 2.10.

2.10

Using the table the area on the right of z = 2.10 is 0.0179. Since this is a right-tailed test, the P-value is
0.0179.

Step 4: Make a decision to reject or do not reject null hypothesis. Since the P-value is less than 0.05,
the decision is “reject H 0 ” .

Chapter 9: Hypothesis Testing (Part I) 145


Step 5: Summarize the results.
There is enough evidence to support the claim that the average age of lifeguards in Ocean City is greater
than 24 years.

9.4.2 The Confidence Interval Method


For a two-tailed test, we can also use the confidence interval method as an alternative. If the hypothetical
value of parameter lies within the interval, do not reject H 0 otherwise reject the H 0 .

Steps in Confidence Interval Method


Step 1: State the hypothesis.
Step 2: Find the confidence interval.
Step 3: Make a decision to reject or do not reject null hypothesis.
Step 4: Summarize the results.

EXAMPLE 9−6

Sugar is packed in 5 lbs bags. An inspector suspects the bags may not contain 5 lbs. A sample of 50
bags produces a mean of 4.6 lbs and assume the population standard deviation is 0.7 lbs. Is there enough
evidence to conclude that the bags do not contain 5 lbs as stated at  = .05? Use confidence interval
method.

SOLUTION

We need to test whether,   5 (claim), which should be stated in alternate hypothesis.


Step 1: State the hypothesis
H0    5
H1    5 (claim)

Step 2: Find the confidence interval.


Since  = 0.05, find a 95% confidence interval for  , we have:

X  46,   0.7 , n  50 and z /2  196


Therefore, the 95% confidence interval of  is
 
X  z  2    X  z  2 
n n

Chapter 9: Hypothesis Testing (Part I) 146


 07   07 
46  (196)      46  (196)  
 50   50 
44    48

Step 3: Make a decision to reject or do not reject null hypothesis.


Since the confidence interval does not contain the hypothesized value  = 5, we reject the null hypothesis
H0    5 .

Step 4: Summarize the results.


There is enough evidence to conclude that the bags do not contain 5 lbs.

9.5 Summary
This chapter introduces the concept of hypothesis testing. The concepts discussed in this chapter are as
follows: statistical hypothesis; null and alternate hypothesis; statistical test; type I and type II error; level
of significance; critical and non-critical region; z-test for mean; methods of hypothesis testing.

Chapter 9: Hypothesis Testing (Part I) 147


EXERCISES

1. Define null hypothesis and alternate hypothesis, and give an example of each.

2. Write the null and alternative hypothesis for each of the following examples. Determine if each is a
case of a two-tailed, a left-tailed, or a right-tailed test.
A. To test if the mean amount of time spent per week watching sports on television by all adult men
is different from 9.5 hours.
B. To test if the mean amount of money spent by all customers at a supermarket is less than $105.
C. To test whether the mean starting salary of college graduates is higher than $39000 per year.
D. To test if the mean waiting time at the drive-through window at a fast food restaurant during rush
hour is at least 10 minutes.
3. The average 1-year old is 29 inches tall. A random sample of 30 1-year olds in a large day care
resulted in the following heights. At α = 0.05, can it be concluded that the average height differ from
29 inches? Assume σ = 2.61.

25 32 35 25 30 26.5 26 25.5 29.5 32


30 28.5 30 32 28 31.5 29 29.5 30 34
29 32 27 28 33 28 27 32 29 29.5

4. A researcher claims that adult dogs fed a special diet will have an average weight of 200 lbs. A
sample of 40 dogs has an average weight of 198.2 lbs and a standard deviation of 3.3 lbs.
A. At α = 0.05 can the claim be rejected? Use traditional method.
B. Also, find the 95% confidence interval of the true mean and verify the result in part A above.

5. A Pacific Tapioca manufacturer claims that the packets of tapioca chips they make have a mean
weight of 980g. The standard deviation of the weights is known to be 15g. A random sample of 150
packets has a mean weight of 985g. Does this result support the manufacturer claim? Use α = 0.1
and the P-value method to test this.

6. The average production of sugarcane in Fiji is 3000 pounds per acre. A new plant food have been
developed and is tested 60 individual plots of land. The mean yield with new plant food is 3120
pounds of sugarcane per acre, and the population standard deviation is 578 pounds. At   0.05,
can you conclude that the average production has increased?

Chapter 9: Hypothesis Testing (Part I) 148


CHAPTER 10:

HYPOTHESIS TESTING
(PART II)

Chapter 10: Hypothesis Testing (Part II) 149


Overview
In this chapter, we discuss the t-test for mean and the z-test for population proportion. The chapter
concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Test means when population standard deviation is unknown, using the t-test.
2. Test proportions, using a z-test.

10.1 Introduction
In the previous section, we have discussed the basic concepts of hypothesis testing, the z-test for testing
the population mean and the different methods to test hypothesis. In this Chapter, we will discuss the t-
test for population mean and the z-test for the population proportion.

10.2 t-test for Mean


We have learnt in the previous section that when  is known for any sample size (If n < 30, the variable
must be normally distributed), the z -test is used to test the population mean.

However if  is unknown for any sample size (If n < 30, the variable must be normally distributed), the t-
test is used to test the population mean.

Test Statistic
The value of test statistic is obtained by:

X 
t , when  is not known.
s n

The degrees of freedom (d.f) is n  1.

EXAMPLE 10−1

A job placement director claims that the average starting salary for nurses is $24,000. A sample of 10
nurses has a mean of $23,450 and a standard deviation of $400. Is there enough evidence to reject the
director’s claim at  = 0.05? Assume the variable must be normally distributed.

SOLUTION

We need to test here,  = 24000 (claim), which should be stated in null hypothesis.

Step 1: State the hypothesis


H 0    24000 (claim)
H1    24000

Chapter 10: Hypothesis Testing (Part II) 150


Step 2: Find the critical value.
Since  = 0.05 and the test is two–tailed, so the area on the left tail and right tail are 0.05/2 = 0.025.
Using the t-distribution table from the Eton tables with d. f. = 9 and  = 0.05 (or 2p = 0.05), the critical
values are t = + 2.365. See the diagram below.

Critical
region
Critical
Acceptance region
region

−2.365 2.365

Step 3: Compute the test statistics value


Given: n  10, X  23450,   24000 and s  400 .
X  23450  24000
Therefore, t   4.35.
s n 400 10

Step 4: Make a decision


Since the test value, t= -4.35 falls in the rejection region, the decision is: “reject H 0 ”.

Step 5: Summarize the results.


There is enough evidence to reject the claim that the starting salary of nurses is $24,000.

EXAMPLE 10−2

An MP claims that the average number of acres in his province’s State Parks is less than 2000 acres. A
random sample of five parks is selected and the number of acres is shown below. Assume the variable
must be normally distributed.

959, 1187, 493, 6249, 541

At  = 0.01 is there enough evidence to support the claim?

SOLUTION

We need to test here,   2000 (claim), which should be stated in alternate hypothesis.

Step 1: State the hypothesis


H 0    2000
H1    2000 (claim)

Chapter 10: Hypothesis Testing (Part II) 151


Step 2: Find the critical value.
Since  = 0.01 and the test is left–tailed, so the area on the left tail is 0.01. Using the t-distribution table
from the Eton tables with d. f. = 4 and  = 0.01 (p = 0.01), the critical value is t =−3.747 See the diagram
below.

Critical
region Acceptance
region

−3.747

Step 3: Compute the test statistics value


For the given sample values, X  18858 and s  24563
Therefore,
X  18858  2000
t   0104 .
s n 24563  5

Step 4: Make a decision


Since the test value, t= -0.104 falls in the acceptance region, the decision is: “ do not reject H 0 ”.

Step 5: Summarize the results.


We may conclude that there is not enough evidence to support the claim.

10.3 z-test for Proportion


Often we want to test a hypothesis about population proportion, p , of a characteristic. In this section, we
will discuss the hypothesis testing of p when the sample size is large. The procedures are very similar
to that the procedures for testing hypothesis about  discussed earlier.

Test statistic
If a large sample of size n is drawn for testing a population proportion, the value of test statistics ( z test)
is given by:
p̂  p
z 
pqn
Where,
X
pˆ  (sample proportion)
n
p  population proportion
n  sample size

Chapter 10: Hypothesis Testing (Part II) 152


EXAMPLE 10−3

An educator estimates that the dropout rate for seniors at high schools in a particular city 15%. Last year,
38 seniors from a random sample of 200 seniors withdrew. At  = 0.05, is there enough evidence to
reject the educator’s claim? Use tradition method.

SOLUTION

We need to test here, p = 0.15 (claim), which should be stated in null hypothesis.
Step 1: State the hypothesis
H 0  p  0.15 (claim)
H1  p  0.15
Step 2: Find the critical value.
Since   005 and the test is two–tailed, so the area on the left tail and right tail is 0.025. Draw a
standard normal curve and find the z-values using the tables from the Eton tables. The z-values are z =
+1.96. So the critical values are z = +1.96. See the diagram below.

Critical
region Acceptance Critical
region region

−1.96 1.96

Step 3: Compute the test statistics value


Here, p  0.15, q  0.85, n  200 and pˆ  38 / 200  0.19.
Therefore,
pˆ  p 019  015
z   158
pqn (015)(085)  200

Step 4: Make a decision


Since the test statistics value z  158 falls in acceptance region, we do not reject H 0 . Hence, we may
conclude that the educator’s claim can be accepted.

Chapter 10: Hypothesis Testing (Part II) 153


EXAMPLE 10−4

A recent study found that, at most, 32% of people who have been in a plane crash have died. In a sample
of 100 people who were in a plane crash, 38 died. Should the study’s claim be rejected? Use  = 0.05.
Use tradition method.

SOLUTION

We need to test here, p  0.32 or p  0.32 (claim), which should be stated in null hypothesis.
Step 1: State the hypothesis
H 0  p  0.32 (claim)
H1  p  0.32

Step 2: Find the critical value.


Since α= 0.05 and the test is right–tailed, so the area on the right tail is 0.05. Using the standard normal
table from the Eton tables, the critical values is z = 1.65 See the diagram below.

Critical
Acceptance region
region

1.65

Step 3: Compute the test statistics value


Given: pˆ  38  100  038, p  032 and q  1  p  068. Therefore,
pˆ  p 038  032
z   129
pqn (032)(068)100

Step 4: Make a decision


Since the test statistics value z = 1.29 falls in acceptance region, we do not reject H 0 . Hence, we may
conclude that there is not enough evidence to reject the claim.

Chapter 10: Hypothesis Testing (Part II) 154


EXAMPLE 10−5

At a large university, a study found that no more than 25% of the students who commute travel more than
14 miles to campus. At  = 0.10, test the findings that if in a sample of 100 students, 30 drove more than
14 miles. Use the P-value method.

SOLUTION

Step 1: State the hypothesis


The study found that no more than 25% of the students who commute travel more than 14 miles to
campus, that is, p  025. Therefore, the hypothesis to be tested are:
H 0  p  025
H1  p  025
Step 2: Compute the test statistics value
Given that pˆ  30  100  03, p  025 and q  1  p  075. Therefore,
pˆ  p 03  025
z   115
pqn (025)(075)100
Step 3: Compute the P-value.
Using the standard normal table from the Eton tables, we find the area on the right of z = 1.15

1.15

Using the table, the area on the right of z = 1.15 is 0.1251. Since this is a right-tailed test, the P-value is
0.1251.

Step 4: Make a decision to reject or do not reject null hypothesis.


Since the P-value is greater than 0.1, the decision is “do not reject H 0 ” . Hence, we may conclude that
there is not enough evidence to reject the findings.

10.4 Summary
This chapter discusses the t-test for mean and the z-test for population proportion.

Chapter 10: Hypothesis Testing (Part II) 155


EXERCISES

1. An attorney claims that more than 25% of all lawyers advertise. A sample of 200 lawyers in a certain
city showed that 63 had used some form of advertising. At  = 0.05 is there enough evidence to
support the attorney’s claim? Use the P-value method.

2. A recent survey found that 68% of the populations own their homes. In a random sample of 150
heads of households, 92 responded that they owned their homes. At  = 0.01 level of significance,
does that suggest a difference from the national proportion? Use traditional method.

3. The average family size was reported as 3.18. A random sample of families in a particular school
district resulted in the following family sizes:

5 4 5 4 4 3 6 4 3 3

5 6 3 3 2 7 4 5 2 2

3 5 2 2

At  = 0.05, does the average family size differ from the national average? To test the claim:
A. Use a confidence interval method.
B. Use a traditional method.

4. A researcher in Vanuatu claims that a factory worker in Vanuatu earns an average of $700 per week.
A sample of 400 factory workers in Vanuatu showed that they earn an average of $685 per week
with a standard deviation of $125. Using  = 0.01, can you conclude that there is evidence to support
the researcher’s claim? Use the confidence interval method.

5. A food company is planning to market a new type of frozen yogurt. However, before marketing this
yogurt the company wants to find want percentage of the people like it. The company’s management
has decided it will market this yogurt only if at least 35% of the people like it. The company’s research
department selected a random sample of 400 persons and asked them to taste this yogurt. Of these
400 persons, 112 said they liked it. Testing at the 2.5% significance level, can you conclude that the
company should market this yogurt? Use traditional method.

Chapter 10: Hypothesis Testing (Part II) 156


CHAPTER 11:

TESTING THE EQUALITY OF


TWO POPULATION MEANS

Chapter 11: Testing the Equality of Two Population Means 157


Overview
This chapter explains the hypothesis testing of the equality of two population means. The concepts
discussed in this chapter are z− test and the t−test for testing two population means. The chapter
concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Test the difference between two sample means using, the z-test.
2. Test the difference between two means for independent samples, using the t-test.

11.1 Introduction
The basic concepts of hypothesis testing were explained in Chapter 8. With the z and t tests, a sample
mean or proportion can be compared to a specific population mean or proportion.

There are, however, many instances when the researchers wish to compare two sample means, using
experiments and control groups. For example, the average lifetimes of two different brands of bus tires
might be compared to see whether there is any difference in the tread wear. Two different brands of
fertilizer might be tested to see whether one is better than the other for growing plants.

In comparing of the means, the same basic steps for hypothesis testing are used and z and t-tests are
also used. When comparing two means by using t-test, the researcher must decide whether the samples
are independent or dependent.

11.2 z-test for two Means


Researchers often want to compare two population means using two samples drawn from the
populations. For example, a researcher wishes to know whether there is a difference in the average age
of students enrolled at USP On-Campus (OC) and those enrolled through Distance and Flexible Learning
(DFL)? Then, the hypotheses to be tested are:
H 0 : 1  2 H 0  1   2  0
or
H1 : 1  2 H1  1   2  0
Where,
1  mean age of students enrolled OC
2  mean age of students enrolled through DFL

To test the difference between two means we have to know whether the two samples drawn from the
populations are dependent or independent, large or small and the population standard deviations
known or unknown.

11.2.1 Dependent and Independent Samples


Two samples drawn from two populations are independent if the selection of one sample from one
population does not affect the selection of the second sample from the second population. Otherwise,
the samples are dependent. Suppose we would like to compare the mean salaries of male and female
staffs in USP. Then, the samples must be drawn separately from two distinct groups, all males in USP
and all females in USP. These samples are known as independent samples. If we would wish to study

Chapter 11: Testing the Equality of Two Population Means 158


whether a new drug is effective in controlling a disease, two samples are taken before and after the use
of drug from the same unit. These samples are known as dependent samples.

11.2.2 Hypothesis
If we wish to decide whether the means of the populations from where two independent samples were
selected are really different or same, then the null hypothesis is H0 : 1  2 (i.e. the means are not
different) and the alternative hypothesis could be any one of the following:

(i) H1 : 1  2 (two-tailed test)


or, (ii) H1 : 1  2 (left-tailed test)
or, (iii) H1 : 1  2 (right-tailed test)

Assumptions for the z-test


 Both samples are random samples and independent of each other.
 The standard deviation of both samples must be known and if the sample sizes are less than
30, the population must be normally distributed.

Test Statistic
The value of test statistic if  1 and  2 are known:

z
X 1  X 2    1  2 
.
 12  22

n1 n2

If  1 and  2 are known but the sample sizes are small (population normally distributed) the value of test
statistic will be same.

Confidence Interval Formula


Confidence interval for the difference between two means in the above cases can be found by:

 12  22  12  22
 X1  X 2   Z /2 n1

n2
 1  2   X 1  X 2   Z /2
n1

n2
.

Chapter 11: Testing the Equality of Two Population Means 159


EXAMPLE 11−1

A survey found that the average hotel room rate in FJ is $88.42 and the average room rate in NZ is
$80.61. Assume that the data were obtained from two samples of 50 hotels each and that the population
standard deviations were $5.62 and $4.83 respectively. At  = 0.05, can it be concluded that there is no
significant difference in the rates?

SOLUTION

We need to test here, 1  2 (claim), which should be stated in null hypothesis.

Step 1: State the hypothesis


H 0 : 1  2 (claim)
H1 : 1  2

Step 2: Find the critical value.


Since = 0.05 and the test is two–tailed, using the standard normal tables we get critical values as
z = +1.96 See the diagram below.

Critical region Critical region


Acceptance
region

−1.96 1.96

Step 3: Compute the test statistics value


Given that X1  88.42 , X 2  80.61 , 1  5.62 ,  2  4.83 , n1  50 , and n2  50 . Since the sample
sizes are large, the test statistics value is:

z
X 1  X 2    1  2 

8842  8061  0  745
 12  22 5622  4832
 50 50
n1 n2

Step 4: Make a decision.


Since the test value z=7.45, falls in rejection region, the decision is: “reject H 0 ”. Hence, it be concluded
that there is significant difference in the rates.

Chapter 11: Testing the Equality of Two Population Means 160


EXAMPLE 11−2

Solve Example 11-1 using:


A. P-value method.
B. Confidence Interval method.

SOLUTION

A. The P-value is approximately equal to 0. Since the P-value is less than 0.05, we reject null hypothesis.
B. Since α = 0.05, we have to construct 95% confidence level of 1  2 . Substituting into the formula
one gets:
8842  8061  1.96 5622  4832      8842  8061  1.96 5622  4832
1 2
50 50 50 50
5.76  1  2  9.86.

Since the confidence interval does not contain zero, one would reject the null hypothesis.

EXAMPLE 11−3

The data shown are the rental fees (in dollars) for two random samples of apartment in a large city. At
𝛼 = 0.10, can it be concluded that the average rental fees for apartments in the east are greater than
the average rental fee in the west? Assume 1  119 and  2  103 .

East West
495 390 540 445 420 525 400 310 375 750
410 550 499 500 550 390 795 554 450 370
389 350 450 530 350 385 395 425 500 550
375 690 325 350 799 380 400 450 365 425
475 295 350 485 625 375 360 425 400 475
275 450 440 425 675 400 475 430 410 450
625 390 485 550 650 425 450 620 500 400
685 385 450 550 425 295 350 300 360 400

SOLUTION

We need to test here, 1  2 (claim), which should be stated in alternate hypothesis.


Step 1: State the hypothesis
H 0 : 1  2
H1 : 1   2 (claim)
Step 2: Find the critical value.

Chapter 11: Testing the Equality of Two Population Means 161


Since   01 and the test is right–tailed, using the standard normal tables we get critical value as z =
1.28. See the diagram below.

Critical
Acceptance
region
region

1.28

Step 3: Compute the test statistics value


Given that X 1  477.43 , X 2  437.35 , 1  119 ,  2  103 , n1  40 , and n2  40 . Since the
sample sizes are large, the test statistics value is:

z
X 1  X 2    1  2 

 477.43  437.35  0  1.61
 12  22 1192  1032
 40 40
n1 n2

Step 4: Make a decision


Since the test value z=1.61, falls in rejection region, the decision is: “Reject H 0 ”. It can be concluded
that average rental fees for the east apartments is greater than the average rental fees for the west
apartment.

11.3 t-test for Two Means (Independent Samples)


We will use t-test for testing the hypotheses, if the following two conditions are satisfied:
 Population standard deviations are unknown and unequal.
 If the samples are small (i.e. n1  30 and n2  30 ), then the populations from which the
samples are drawn are normally distributed.
 Samples are independent.

Test Statistic
The value of test statistic is:

t
X 1  X 2    1  2 
s12 s22

n1 n2

The degrees of freedom (d.f) is equal to the smaller of n1 1 and n2 1.

Chapter 11: Testing the Equality of Two Population Means 162


Confidence Interval Formula
Confidence interval for the difference between two means in this case is:

s12 s22 s12 s22


X 1  X 2   t /2   1  2   X1  X 2   t /2
n1 n2
 .
n1 n2

The degrees of freedom (d.f) is equal to the smaller of n1 1 and n2 1.

EXAMPLE 11−4

The average size of a farm in Ba is 191 acres. The average size of a farm in Nadi is 199 acres. Assume
the data were obtained from two samples with standard deviations of 32 and 12 acres, respectively and
sample sizes 8 and 10, respectively. Can it be concluded at  = 0.05 that the average size of the farm
in the two districts in Fiji is different? Assume the populations are normally distributed.

SOLUTION

We need to test here, 1  2 (claim), which should be stated in alternate hypothesis.

Step 1: State the hypothesis


H 0 : 1  2
H1 : 1   2 (claim)

Step 2: Find the critical value.


Since  = 0.05 and the test is two–tailed, using the t-distribution table from the Eton tables with d.f =
8−1=7 and 2p = 0.0.5 we get critical value as t = + 2.365. See the diagram below.

Critical Critical
region Acceptance region
region

−2.365 2.365

Step 3: Compute the test statistics value


Given that X 1  191 , X 2  199 , s1  38 , s2  12 , n1  8 , and n2  10 . Since the population
standard deviations are unknown, the test value is

t
X 1  X 2    1  2 

191  199   0
 0.67.
s12 s22 322  122
 8 10
n1 n2

Chapter 11: Testing the Equality of Two Population Means 163


Step 4: Make a decision
Since the test value t=-0.67, falls in acceptance region, the decision is: “do not reject H 0 ”. There is not
enough evidence to support the claim that the average size of the farm in the two districts in Fiji is different.

EXAMPLE 11−5

The mean age of a sample of 25 people who were playing soccer is 48.7 years, and standard deviation
is 6.8 years. The mean age of a sample of 35 people who were playing rugby is 55.3 years with a standard
deviation is 3.2 years. Can it be concluded at  = 0.05 that the mean age of those playing soccer is less
than those playing rugby. Assume the populations are normally distributed.

SOLUTION

We need to test here, 1  2 (claim), which should be stated in alternate hypothesis.


Step 1: State the hypothesis
H 0 : 1  2
H1 : 1  2 (claim)

Step 2: Find the critical value.


Since  = 0.05 and the test is left–tailed, using the t-distribution table from the Eton tables with d.f =
25−1=24 and p = 0.05 we get critical value as t = −1.711 See the diagram below.

Critical Acceptance
region region

−1.711

Step 3: Compute the test statistics value

t
X 1  X 2    1  2 

 48.7  55.3  0
 4.509.
s12 s22 6.82  3.22
 25 35
n1 n2

Step 4: Make a decision


Since the test value t=-4.509, falls in critical region, the decision is: “Reject H 0 ”. There is enough
evidence to support the claim that the mean age of those playing soccer is less than those playing rugby.

Chapter 11: Testing the Equality of Two Population Means 164


11.4 Summary
This chapter explains the hypothesis testing of the equality of two population means. The concepts
discussed in this chapter are z-test and the t-test for testing two population means.

EXERCISES

1. A researcher claims that the average yearly earnings of male college graduates (with at least a
bachelor’s degree) is different from the average yearly earnings of female college graduates with the
same qualifications. Based on the results below, can it be concluded that there is difference in mean
earnings between male and female college graduates? Use the 0.01 level of significance.

Male Female
Sample mean $59,235 $52,487
Population standard deviation $8,945 $10,125
Sample size 40 35

2. A researcher wishes to see if there is a difference in the cholesterol levels of two groups of men. A
random sample of 30 men between the ages of 25 and 40 is selected and tested; the average
cholesterol level was 223 with standard deviation of 6.1. A second sample of 25 men between ages
of 41 and 56 is selected and tested; the average cholesterol level for this group was 229 with standard
deviation of 5.8. Assume the populations are normally distributed and the population standard
deviations are unequal. At   0.01, is there a difference in the cholesterol levels between the two
groups? Use traditional method.

3. The mean height of 20 male athletes in Fiji was 68.2 inches, while 20 male non- athletes in Fiji had
a mean height of 67.5 inches and that the population standard deviations were 2.5 inches and 2.8
inches, respectively. Assume the populations are normally distributed. Test the hypothesis that
athletes are taller than non- athletes at 5% level of significance, using:
A. P-value method.
B. Verify the solution in Part A using confidence interval method.

4. A sample of 35 chemists from Lautoka city shows an average salary of $39,420 with a standard
deviation of $1659, while a sample of 40 chemists from Suva city has an average salary of $30,215
with a standard deviation of $4116. Is there a significant difference between the two cities chemists’
salaries at   0.02?

5. A researcher claims that the mean of the salaries of primary school teachers is greater than the mean
of the salaries of secondary school teachers in Fiji. The mean of the salaries of a sample of 26 primary
school teachers is $48,256, and the sample standard deviation is $3,912.40. The mean of the salaries
of a sample of 24 secondary school teachers is $45,633, and the sample standard deviation is
$5,533. Assume the populations are normally distributed and the population standard deviations are
unequal. At = 0.05 can it be concluded that the mean of the salaries of the primary school teachers
is greater than the mean of the salaries of the secondary school teachers?

Chapter 11: Testing the Equality of Two Population Means 165


CHAPTER 12:

CORRELATION AND
REGRESSION

Chapter 12: Correlation and Regression 166


Overview
This chapter explains the concepts of correlation and regression. The concepts discussed in this chapter
are scatter plots, correlation coefficient, testing the significance of correlation, regression line and
coefficient of determination. The chapter also discusses the concept of multiple linear regressions. The
chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Draw a scatter plot.
2. Compute the correlation coefficient.
3. Test the correlation coefficient.
4. Compute the equation of the regression line.
5. Use the concept of multiple regression.

12.1 Introduction
Another area of inferential statistics involves determining whether a relationship between two or more
quantitative (numerical) variables exists. For example, an educator may want to know whether there is
any relationship between the number of absences and the student’s final grade for a student in her class.
A scientist would be interested in knowing whether there is any relationship between age and blood
pressure of a person.

This chapter considers the relationship between two variables, which can be studied by the correlation
and the regression analysis. Correlation measures how strongly two variables are related and on the
other hand, by regression analysis a model using these two variables is fitted which helps to predict a
value of a variable when the value of other variable is known. For example, correlation can be used by
an economist to find out how strongly income and expenditure of a household are related and regression
can fit a model to predict the expenditure of a house hold for a given income.

There are two types of regression: simple and multiple. In simple regression, there are two variables; an
independent variable, also called explanatory variable or a predictor variable, and a dependent variable,
also called a response variable. In simple regression, the independent variable is used to predict the
dependent variable. In multiple regressions, two or more independent variables exist with only one
dependent variable.

12.2 Correlation
If the change in one variable affects a change in the other variable, then the variables are said to be
correlated and the association between the two variables is known as correlation. In a simple
regression studies, the researcher collects data on two quantitative variables to see whether a
relationship exists between the variables. For example, if a researcher wishes to see whether there is
a relationship between the age and blood pressure of a person, he must select a random sample of
people; record their age and their blood pressure. A table can be made as shown below.

Chapter 12: Correlation and Regression 167


Subject Age, x Pressure, y
A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152

The two variables for this study are called independent and dependent variable. The independent
variable is the one that can be controlled or manipulated. In this case, the age of a person is the
independent variable and is denoted as x . The dependent variable is the one that cannot be controlled
or manipulated and in this case the blood pressure of a person is the dependent variable and is denoted
as y.

The relation between the variables may be positive, negative or no relation.

Positive correlation
If the changes of the variables are in same direction i.e. the increase (or decrease) in one variable
affects in increasing (or decreasing) the other variable, then the variables are positively correlated. For
example, (i) height and weight of persons, (ii) income and expenditure of households, etc. are positively
correlated.

Negative correlation
If the changes in the variables are in opposite direction i.e. the increasing (or decreasing) in one
variable decreases (or increases) the other, then the variables are negatively correlated. For example,
(i) price and demand of commodities, (ii) no. of absences and final exam mark, etc. are negatively
correlated.

No correlation
If two variables are independent of each other and not related in any fashion, then there cannot be any
correlation between the variables. For example, the correlation between:
 height and incomes of individuals,
 marriage rate and the agricultural production rate in a country, and
 The size of shoe and intelligence of a group of individuals should be zero.

Methods of studying correlation


1. Scatter Plots
2. Coefficient of Correlation

12.2.1 Scatter Plots


If the values of two variables are plotted along the x -axis and y -axis respectively, then the diagram of
dots so obtained is known as scatter diagram. It is the simplest method to study the correlation between
two variables.

Chapter 12: Correlation and Regression 168


If the points seem to form a pattern with an upward slope, then the variables are said to be positively
correlated.

y y

x x

If the points seem to form a pattern with a downward slope, then the variables are said to be
negatively correlated.

y y

x x

If the points do not form any pattern with downward or upward slope, then the variables are said to be
uncorrelated.

Chapter 12: Correlation and Regression 169


EXAMPLE 12−1

Construct a scatter plot for the data obtained in a study of age and systolic blood pressure of six
randomly selected subjects.

Subject Age, x Pressure, y


A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152

SOLUTION

The scatter diagram for the given data is shown below:

160
150
140
pressure

130
120
110
100
30 40 50 60 70 80
Age

The above scatter diagram indicates that there is a positive correlation between the age and the blood
pressure.

EXAMPLE 12−2

Marks of eight students who sat an examination in English and Mathematics are given by

Maths ( x) 35 35 40 45 50 50 60 69
English ( y ) 50 40 30 65 35 50 50 40

Construct a scatter plot.

Chapter 12: Correlation and Regression 170


SOLUTION

The scatter plot for the given data is shown below:

The above scatter diagram indicates that there is no correlation the variables.

12.2.2 The Correlation Coefficient


The correlation coefficient computed from sample data measures the strength and direction of a linear
relationship between two variables.
 The symbol for sample correlation coefficient is r.
 The symbol for population correlation coefficient is  .

Formula to compute the sample correlation coefficient ( r ) is given by:

n   xy     x   y 
r ,
 n  x 2    x 2   n  y 2    y 2 
       
Where,
n is the number of data pairs.

Note:
 The values of r is always between –1 and +1, that is, 1  r  1.
 r is close to 1, there is a strong positive relationship,
 r is close to –1, there is a strong negative relationship,
 r is close to 0, there is a little or no relationship. See the diagram below.

Chapter 12: Correlation and Regression 171


EXAMPLE 12−3

Calculate the correlation coefficient for the data in Example 12−1.

SOLUTION

x y xy x2 y2
43 128 5504 1849 16384
48 120 5760 2304 14400
56 135 7560 3136 18225
61 143 8723 3721 20449
67 141 9447 4489 19881
70 152 10640 4900 23104
 x  345  y  819  xy  47634  x 2
 20399 y
2
 112443

With n = 6,

6  47634    345  819 


r  0.897
6  20399    345 2  6 112443   819 2 
  

This shows there is a strong positive linear correlation between the two variables, age and blood
pressure.

EXAMPLE 12−4

Calculate the correlation coefficient for the following data.

No. of absences, x 6 2 15 9 12 5 8
Final exam mark, y 82 86 43 74 58 90 78

SOLUTION

x y xy x2 y2
6 82 492 36 6724
2 86 172 4 7396
15 43 645 225 1849
9 74 666 81 5476
12 58 696 144 3364
15 90 450 25 8100
8 78 624 64 6089
 x  57  y  511  xy  3745  x  579
2
y
2
 38993

With n = 7,

Chapter 12: Correlation and Regression 172


7  3745    57  511
r  0.944.
7  579    57 2  7  38993   5112 
  

This shows there is a negative linear correlation between the two variables, number absences and final
exam mark of a student.

12.2.3 Hypothesis Testing of Correlation Coefficient


The sample correlation coefficient r indicates the relationship between the variables for a sample, but if
we want to generalize this for the population we have to test the hypothesis. To perform a test of
hypotheses about the population correlation coefficient  , we use the t-distribution.

Hypotheses

H 0 :   0 (There is no correlation between the variables)


H1 :   0 (There is correlation between the variables)

Test Statistic
If both variables are normally distributed, then the value of the test statistic for testing H0 :   0,
calculated by:
n2
tr ,
1 r2

It has t-distribution with the degrees of freedom, d . f  n  2.

EXAMPLE 12−5

Test the significance of the correlation coefficient for the age and blood pressure data.

SOLUTION

In Example 12−3, we obtained r = 0.897. This shows there is a strong positive linear correlation between
age and blood pressure in the sample data. To conclude the same for the population we have to carry out
hypothesis testing.

Hypotheses

H 0 :   0 (There is no correlation between the variables)


H1 :   0 (There is correlation between the variables)

Critical value
Since the value of alpha is not given, we use 𝛼 = 0.05 and d. f = 6 −2 = 4. Looking at t-distribution table
from the Eton Table with   4 and 2 p = 0.05 (two tailed test) we have the critical value, t /2  2.776.

Test Statistic:

Chapter 12: Correlation and Regression 173


62
t  0.897  4.059.
1  0.8972

Conclusion: Since the test value t = 4.059 is in the critical region, H 0 is rejected at 5% level of
significance. Hence, there is significant correlation between age and blood pressure.

EXAMPLE 12−6

Test the significance of the correlation coefficient for the number of absences and final exam mark data,
using 𝛼 = 0.01.

SOLUTION

In Example 12−4, we obtained r = 0.944. This shows there is a negative linear correlation between the
variables in the sample data. To conclude the same for the population we have to carry out hypothesis
testing.

Hypotheses

H 0 :   0 (There is no correlation between the variables)


H1 :   0 (There is correlation between the variables)

Critical value:
Since   0.01 and d . f  7  2  5. . Looking at t-distribution table from the Eton tables with   5
and 2 p = 0.01 (two tailed test) we have the critical value, t /2  4.032.

Test Statistic:

72
t  0.944   6.398.
1  (0.944) 2

Conclusion: Since the test value t = -6.398 is in the critical region, H 0 is rejected at 1% level of
significance. Hence, there is significant correlation between the variables.

Chapter 12: Correlation and Regression 174


EXAMPLE 12−7

A manager wishes to find out whether there is a relationship between the age of employees and the
number of sick days they take each year. The manager selects a sample randomly 6 of his employees
and the data are as follow:

Age, x
18 26 39 48 53 58
Days, y
16 12 9 5 6 2

Test whether the correlation between the age of employees and the number of sick days is
significant at 5% level of significance.

SOLUTION

We have

n  6,  x  242,  y  50,  x  10998, y  546, and  xy  1625.


2 2

The sample correlation coefficient is:

6 1625    242  50 
r  0.979.
6 10998    242 2  6  546    50 2 
  

Hypotheses

H 0 :   0 (There is no correlation between the variables)


H1 :   0 (There is correlation between the variables)

Critical value:

Since   0.05 and d. f . = 4, so the critical value, t /2  2.776.

Test Statistic:

62
t  0.979   9.604.
1  (0.979) 2

Conclusion: Since the test value t = -9.604 is in the critical region, H 0 is rejected at 5% level of
significance. There is a significant relationship between a person’s age and the number of sick days that
a person takes each year.

Chapter 12: Correlation and Regression 175


12.3 Simple Linear Regression
To study the relationship between two variables, we collect data and then construct a scatter plot. The
purpose of the scatter plot is to determine nature of relationship. The possibilities include a positive
linear, negative linear, or no relationship. After scatter plot is drawn, we compute the value of the
correlation coefficient and then test the significance of the correlation. If there is significant correlation
between the variables, the next step is to determine the equation of the regression line (also called
line of best fit). If there is no significant correlation between the variables, then proceeding to regression
is meaningless.

Equation of the Regression Line


The equation of the regression line is written as y '  a  bx, where

a
  y    x     x   xy  ,
2

b
n   xy     x   y 
,
n  x    x n   x2     x 
2 2 2

Where,
a is called the intercept and
b is the slope of the regression line.

EXAMPLE 12−8

Find the equation of the regression line for the data in Example 12−1. Use the regression line to predict
the blood pressure of a person who is 50 years old.

SOLUTION

We have n  6,  x  345,  y  819,  x 2


 20399, and  xy  47634. Therefore,

a
819  20399    345  47634   81.048 6  47634    345  819 
and b   0.964.
6  20399    345  6  20399    345 
2 2

Hence the equation of the regression line is: y '  81.048  0.964 x. The blood pressure of a person who
is 50 years old is: y'  81.048  0.964(50)  129.

EXAMPLE 12−9

For the data in Example 12-7, find the equation of the regression line. Also, predict y when the age (x)
of an employee is 47 years.

SOLUTION

We have n  6,  x  242,  y  50,  x 2


 10998, and  xy  1625.

Chapter 12: Correlation and Regression 176


Therefore,

a
 50 10998    242 1625   21.099 6 1625    242  50 
and b   0.317.
6 10998    242  6 10998    242 
2 2

Hence the equation of the regression line is: y '  21.099  0.317 x. The number of sick days for an
employee who is 47 years old is y '  21.099  0.317(47)  6.22  6 days.

Coefficient of Determination
We now know how to construct a linear regression model, but:
 How good is the regression model?
 How well does the independent variable explain the dependent variable in the regression model?

The coefficient of determination is one concept that answers this question. The square of the correlation
coefficient is known as the coefficient of determination, that is:

Coefficient of determination  r 2 , where 0  r 2  1.

It gives us the proportion of total variation is explained (accounted for) by the use of regression model. If
r2 is very close to 1 then you know your model is very good to predict the y.

EXAMPLE 12−10

The following data represent trends in cigarette consumption (x) per capita and lung cancer
mortality rate (y) in a county.
Consumption (x) 11.8 12.5 15.7 19.2 21.9 23.3
Mortality rate (y) 10.4 16.5 22.9 26.6 33.8 42.8

A. Calculate the coefficient of correlation between x and y.


B. Test whether the coefficient of correlation obtained in (A) is significant at 5% level of ignificance.
C. Find the equation of the regression line for predicting mortality rate.
D. Estimate the mortality rate when cigarette consumption is 18.5.
E. Calculate and interpret the coefficient of determination.

SOLUTION

x y xy x2 y2
11.8 10.4 122.72 144 324
12.5 16.5 206.25 100 289
15.7 22.9 359.53 196 529
19.2 26.6 510.72 121 361
21.9 33.8 740.22 144 400
23.3 42.8 997.24 81 225

 x  104.4  y  153  xy  2936.68  x 2


 1933.12 y 2
 4586.66
A. Here, n = 6. The coefficient of correlation (r) is:

Chapter 12: Correlation and Regression 177


6  2936.68   104.4 153 
r  0.971.
6 1933.12   104.4 2  6  4586.66   153 2 
  

B. Test the sample correlation coefficient:

Hypotheses

H 0 :   0 (There is no correlation between the variables)


H1 :   0 (There is correlation between the variables)

Critical value:
Since   0.05 and d. f . = 4, so the critical value, t /2  2.776.

Test Statistic:

62
t  0.971  8.12.
1  (0.971)2

Conclusion: Since the test value lies in the critical region, H0 is rejected at 5% level of significance.
Hence, we may conclude that the correlation between the cigarette consumption per capita and lung
cancer mortality rate is significant.

C. We have

a
1531933.12   104.4  2936.68   15.4742 and
6 1933.12   104.4 
2

b
 6  2936.68   104.4 153  2.3548
6 1933.12   104.4 
2

Hence the equation of the regression line is: y '  15.4742  2.3548x.

D. When the cigarette consumption 18.5, the mortality rate y '  15.47  2.3548(18.5)  28.09.

E. Coefficient of determination = r 2  (0.971)2  0.943 . This means that 94.3% of the total variation
is explained by the linear regression model.

Chapter 12: Correlation and Regression 178


12.4 Multiple Linear Regression
The previous section explained the concepts of correlation and simple linear regression. In simple linear
regression, the regression equation has one independent variable x and one dependent y ' is written as
y '  a  bx, where a is called the intercept and b is the slope of the regression line.

In multiple linear regression there are k independent variables x1 , x2 , , xk and one dependent variable
y ' and the regression equation is given by:

y '  a  b1 x1  b2 x2   bk xk .

A multiple correlation coefficient R can also be computed to determine if a significant relationship exists
between the independent variables and the dependent variable. Since the computations in multiple
regression are quite complicated and for the most part would be done on a computer. We will only
consider examples with 2 independent variables and one dependent variable.

EXAMPLE 12−11

A Lecturer at USP wishes to see whether a student’s grade point average and age are related to the
students score in the final exams. He selects five students and obtains the following data.

Student GPA, x1 Age, x2 Final Exam Score, y


A 3.2 22 80
B 2.7 27 86
C 2.5 24 75
D 3.4 28 98
E 2.2 23 64

We will use Excel for this problem, please follow the steps below:
1. Enter the data in three separate columns of a new worksheet.
2. Select Data tab on the tool bar, then Data Analysis >Regression.

Using Excel, we obtain the following output:


SUMMARY
OUTPUT

Regression Statistics
Multiple R 0.984382
R Square 0.969007
Adjusted R
Square 0.938014
Standard Error 3.40005
Observations 5

Chapter 12: Correlation and Regression 179


ANOVA
df SS MS F Significance F
Regression 2 722.8793 361.4397 31.26548 0.030993
Residual 2 23.12069 11.56034
Total 4 746

Standard Lower Upper Lower Upper


Coefficients Error t Stat P-value 95% 95% 95.0% 95.0%
Intercept -39.8114 16.80644 -2.36882 0.141377 −112.124 32.50084 −112.124 32.50084
X Variable 1 18.18575 3.698114 4.917574 0.038952 2.27405 34.09745 2.27405 34.09745
X Variable 2 2.777876 0.707173 3.928139 0.059119 −0.26485 5.820598 −0.26485 5.820598

From the output, we obtain the following:

1. The multiple correlation coefficient R  0.984382, which indicates that there a strong
relationship between students GPA and age with final exam score.

Note: The multiple correlation coefficient R can range from 0 to +1; it can never be negative. If
it is closer to +1, the relationship is strong and if closer to 0, the relationship is weak.

2. R2  0.969007, is the coefficient of multiple determination and it is the amount of variation


explained by the regression line.

3. To test the correlation coefficient, we can use the P-value given in the output (Significance F)
which is 0.030993. Since the P-value is less than   0.05 , we reject the null hypothesis and
conclude that there is significant conclude that there is strong relationship between students GPA
and age with final exam score.

4. The multiple regression equation obtained is: y '  39.8114  18.18575x1  2.777876x2 .

5. If a student has a GPA of 3.0 and is 25 years old, her predicted final exam score is 84.

EXAMPLE 12−12

A study was conducted, and a significant relationship was found among the number of hours a teenager
watches television per day x1 , the number of hours teenager talks on the telephone per day x2 and the
teenagers weight y. The regression equation is y '  98.7  3.82x1  6.51x2 . Predict a teenagers
weight if she averages 3 hours of TV and 1.5 hours on phone a day.

SOLUTION

Using the regression equation, we have, y '  98.7  3.82(3)  6.51(1.5)  119.91. The teenager’s
weight is 119.91kg if she watches 3 hours of TV and 1.5 hours on the phone per day.

Chapter 12: Correlation and Regression 180


12.5 Summary
This chapter explains the concepts of correlation and regression. The concepts discussed in this chapter
are scatter plots, correlation coefficient, testing the significance of correlation, regression line and
coefficient of determination. The chapter also discusses the concept of multiple linear regression.

EXERCISES

1. Explain the similarities and differences between simple linear regression and multiple regression.

2. Recent agricultural data in Fiji showed the number of eggs produced and the price received per
dozen for a given year.

No. of eggs (millions), x 957 1332 1163 1865 119 273


price per dozen (dollars), y 0.770 0.697 0.617 0.652 1.080 1.420

The summary data is given as follows:

 x  5709,  y  5.236,  x 2
 7609557, y 2
 5.067302,  xy 4115.025

A. Calculate the sample coefficient of correlation between x and y.


B. Test whether the coefficient of correlation obtained in part A is significant at 5% level of
significance.

If the coefficient of correlation is significant in part B, find the following:


C. The equation of the regression line.
D. Calculate and interpret the coefficient of determination.
E. Predict y ' when x  1600 million eggs.

3. A researcher has determined that a significant relationship exists among an employee’s age x1 ,
grade point average x2 , and income y . The multiple regression equation is
y '  34127  132 x1  20805x2 . Predict the income of a person who is 32 years old and has a
GPA of 3.4.

4. The data shown below is for the car rental companies in Fiji for a recent year.

Company A B C D E F
Cars (in thousands), x 63 29 20. 8 19. 1 13. 4 8.5
Revenue (in millions), y 7.0 3.9 2.1 2.8 1.4 1.5

Using the 5% level of significance and r = 0.982, test whether the coefficient of correlation is
significant.

Chapter 12: Correlation and Regression 181


CHAPTER 13:

THE CHI-SQUARE TESTS

Chapter 13: The Chi-Square Tests 182


Overview
This chapter focuses on the chi-square tests to analyse categorical data. The chi-square tests discussed
are: (1) test for goodness of fit; (2) test for independence of variables. The chapter concludes with a
summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Test the distribution for goodness of fit, using chi-square.
2. Test two variables for independence, using chi-square.

13.1 Introduction
This chapter describes the hypothesis testing of categorical data based on chi-square distribution. The
distribution can be used for tests concerning frequency distribution such as, whether observed
frequencies of an experiment follow a certain pattern or theoretical distribution. This test is called, chi-
square test for goodness-of-fit. The chi-square distribution can be used to test the independence of two
attributes. For example, we can test whether two attributes ‘smoking’ and ‘cancer’ are independent.

13.2 The Chi-square Distribution


To test for goodness of fit and to test the independence of two attributes, a new statistical distribution is
needed. It is called the chi-square (the symbol for chi-square is  2 ) which is a family of distribution based
on the degrees of freedom as the t-distribution.

Unlike the t-distribution, which is symmetric about the mean 0, for any degrees of freedom, the chi-square
distribution random variable  2 takes nonnegative values only and is always skewed to the right. The
general shape of chi-square distributions is shown below. It can be seen that the skewness diminishes
as the degrees of freedom ( ) increases.

1.
2.
The value of  2 which leaves an area  (with  d.f.) to its right is represented by 2 .

Chapter 13: The Chi-Square Tests 183


3.
If we know the values of degrees of freedom (d.f.) and  , the area in the right tail, we can find the value
of  2 from the Eton Table as illustrated in the following example.

EXAMPLE 13−1

Find the value of  2 for 5 d.f. and an area of 0.025 in the right of chi-square distribution.
SOLUTION

To find the value of  2 look for  = 5 and  = p = 0.025 in the Table. Therefore, for d.f. = 5, the value
of  2  12.833.

Applications of chi-square tests


Some of the applications of chi-square test are as follows:
i. Chi-square test for ‘goodness of fit’.
ii. Chi-square test for independence of two attributes.

13.3 Test for Goodness of fit


The chi-square goodness-of-fit test is used, when wish to see whether a frequency distribution fits a
specific pattern or a theoretical distribution. For example, one may wish to see whether accidents occur
more often on some days than on other. In such case, the frequencies obtained from the actual
performance of an experiment are called the observed frequencies. The test is called goodness-of-fit
because the hypothesis tested is how good the observed frequencies fit a given pattern. The test is
performed by calculating the expected frequencies, for the given pattern set in H 0 .

Observed and Expected Frequencies


 The frequencies obtained from the performance of an experiment are called observed
frequencies and are denoted by O .
 The expected frequencies, denoted by E , are the frequencies that we expect to obtain if the
null hypothesis H 0 is true.
In a goodness–of–fit test, we test the null hypothesis H 0 that the observed frequencies for an experiment
follow a certain pattern or theoretical distribution. The expected frequency for a category is obtained as
E  np

Chapter 13: The Chi-Square Tests 184


Degree of Freedom
The degree of freedom is equal to sample size minus 1, that is d. f. = n – 1 where n denotes the number
of possible outcomes (or categories) for the experiment.

Test statistic
(O  E )2
2   
E
Where,
O = observed frequency for a category
E = expected frequency for a category = np

Note:
1. A chi–square goodness–of–fit test is always a right–tailed test.
2. If the expected frequency of a class is too small (<5), combine it with the expected frequency of an
adjusted class.

Assumption for a goodness–of–fit test


1. Samples must be randomly selected.
2. Sample should be large enough so that the expected frequency for each category is at least 5.

EXAMPLE 13−2

The number of automobile accidents per week in a city is as follows: 12, 8, 20, 2, 14, 10, 15, 6, 9, 4. Are
these frequencies in agreement with the belief that accident conditions were same during this 10 weeks’
period?

SOLUTION

Hypothesis:

H0 : The accident conditions per week are same.


H1 : The accident conditions per week are not same.

Critical Value:
From the Eton Table, the critical value using d. f. = 9 and = 0.05

0.05
2
= 16.919

Test value:

Computation of Expected Frequency and Test Statistics Value:


If H 0 is true, the expected number of accident in every week is same

Chapter 13: The Chi-Square Tests 185


Total accident 100
E = The expected accident per week =   10
No. of week 10

Observed frequency Expected frequency (O  E )2 (O  E ) 2


(O) (E ) E
12 10 4 0.4
8 10 4 0.4
20 10 100 10.0
2 10 64 6.4
14 10 16 1.6
10 10 0 0.0
15 10 25 2.5
6 10 16 1.6
9 10 1 0.1
4 10 36 3.6
(O  E )2
 E  26.6
(O  E )2
 The test value is   
2

E
 26.6

Conclusion:
Since the test value lies in the critical value, H 0 is rejected at 5% level of significance. Hence, we may
conclude that the accident conditions per week are not same.

EXAMPLE 13−3

The theory predicts the proportion of beans in the four groups A, B, C and D should be 9:3:3:1. In an
experiment with 1600 beans the numbers in the four groups were 882, 313, 287 and 118. Does the
experimental result support the theory? Use 5% level of significance.

SOLUTION

Hypothesis:

H 0 : There is no difference between experimental and theoretical results (i.e. experimental result
supports the theory that the proportions of four types of bean are 9:3:3:1).
H1 : The experimental result does not support the theory.

Critical Value:

The critical value using d . f .  3 and   0.05 is


0.05
2
= 7.815

Chapter 13: The Chi-Square Tests 186


Test value:

Compute expected frequency and test statistics value:


If H 0 is true, the expected proportions of four types of bean are 9:3:3:1. Therefore, the expected
frequencies are obtained by dividing 1600 in the ratio 9:3:3:1.

Observed frequency Expected frequency (O  E )2 (O  E ) 2


(O) (E ) E
9
1600 = 900
882 16 324 0.360

3
1600 = 300
313 16 169 0.563

3
1600 = 300
287 16 169 0.563

1
1600 = 100
118 16 324 3.240

(O  E )2
 E  4.726
(O  E )2
 The test value is   
2

E
 4.726

Conclusion:
Since the test value falls in the acceptance region, we do not reject H 0 . Hence; we may conclude that
the experimental results support the theory.

13.4 Test for Independence


Chi-Square test for independence of attributes is used when we want to test whether two attributes are
independent, or whether there is any association between the attributes.

For example, if we want to test:


i. whether smoking is the cause of cancer,
ii. or, whether a drug is effective in controlling a disease

We use chi-square test for independence of attributes.

Chapter 13: The Chi-Square Tests 187


Hypothesis

A test of independence involves a test of the null hypothesis that two attributes of a population are
independent, that is,
H 0 : The attributes are independent (i.e. there is no association or relation between the attributes)
H1 : The attributes are not independent (i.e. there is association or relation between the attributes)

Test Statistic

The value of the test statistic  2 for a test of independence is calculated as

(O  E )2
 
2
,
E
Where,
O and E are the observed and expected frequencies, respectively, for a cell.

Degrees of Freedom

In testing independence of two attributes, the information is presented in a contingency table where one
attribute is arranged in rows and another attribute is arranged in columns.

The degrees of freedom is:


d. f .  (r 1)(c 1) ,

Where,
r and c are the numbers of rows and the number of columns, respectively, in the given
contingency table.

Expected Frequencies

The expected frequency E for a cell is calculated as:


(Row total)(Column total)
E .
Sample size

Chapter 13: The Chi-Square Tests 188


EXAMPLE 13−4

In an experiment on immunization of cattle from tuberculosis, the following results were obtained:

Vaccination Tuberculosis
Affected Unaffected
Inoculated 12 28
Not inoculated 13 7

Examine whether the vaccine is effective in controlling the disease at 5% level of significance.

SOLUTION

We are given the following information:

Vaccination Tuberculosis Total


Affected Unaffected
Inoculated 12 28 40
Not inoculated 13 7 20
Total 25 35 60

Hypothesis:

H 0 : There is no relation between the vaccination and the tuberculosis (i.e. the vaccine is not effective in
controlling the disease).

H1 : The vaccine is effective in controlling the disease.

Critical Value:

Degrees of freedom:   (r 1)(c 1)  (2 1)(2 1)  1 . The critical value for 1 d.f. at 5% level of
significance is 0.05
2
 3.841

Computation of expected frequency and test statistics value:

The expected frequency is computed by:

(Row total)(Column total)


E
Sample size

Chapter 13: The Chi-Square Tests 189


Observed frequency Expected frequency (O  E )2 (O  E ) 2
(O) (E ) E
40  25
= 16.667
12 60 21.781 1.307

40  35
= 23.333
28 60 21.781 0.933

20  25
= 8.333
13 60 21.781 2.614

20  35
= 11.667
7 60 21.781 1.867

(O  E )2
 E  6.721

(O  E )2
 The test value is   
2
 6.721.
E

Conclusion: Since the test value falls in the critical value, H 0 is rejected at 5% level of significance.
Hence, we may conclude that the vaccine is effective in controlling the disease.

To study the effect of soil condition on the growth of a new hybrid plant, saplings were planted on three
types of soil and their subsequent growth classified in three categories.

EXAMPLE 13−5

Growth Soil Type Total


Clay Sand Loam
Poor 16 8 14 38
Average 31 16 21 68
Good 18 36 25 79
Total 65 60 60 185

Test the hypothesis that there is an association between growth of plant and soil type. Use 1% level of
significance.

SOLUTION

Hypothesis:
H0 : There is no association between growth of plant and soil type.
H1 : There is an association between growth of plant and soil type.

Chapter 13: The Chi-Square Tests 190


Critical Value: Degrees of freedom: d . f . = (3 – 1) (3 – 1) = 4.
The critical value for 4 d.f. at 1% level of significance is 0.01
2
 13.277 .

Computation of expected frequency and test statistics value:

Observed frequency Expected frequency (O  E )2 (O  E ) 2


(O) (E ) E
38  65
= 13.351
16 185 7.017 0.526

38  60
= 12.324
8 185 18.697 1.517

38  60
= 12.324
14 185 2.809 0.228

68  65
= 23.892
31 185 50.524 2.115

68  60
= 22.054
16 185 36.651 1.662

68  60
= 22.054
21 185 1.111 0.050

79  65
= 27.757
18 185 95.199 3.430

79  60
= 25.622
36 185 107.703 4.204

79  60
= 25.622
25 185 0.387 0.015

(O  E )2
 E  13.747
(O  E )2
 The test value is   
2

E
 13.747 .

Conclusion: Since the test value falls in the critical value, H 0 is rejected at 1% level of significance.
Hence, we may conclude that there is an association between growth of plant and soil type.

Chapter 13: The Chi-Square Tests 191


13.5 Summary
This chapter focuses on the chi-square tests to analyse categorical data. The chi-square tests discussed
are: (1) test for goodness of fit; (2) test for independence of variables.

EXERCISES

1. A Westpac Bank in Kiribati has an ATM installed inside the bank, and it is available to its customers
only from 8am to 3pm. The manager wanted to investigate if the number of transaction made is the
same for each of the five days (Monday through to Friday). She randomly selected one week and
counted the number of transaction made for each of the 5 days. The information she obtained is in
the table below.

Day Monday Tuesday Wednesday Thursday Friday


Number of transaction 253 197 204 279 267

Using 2.5% significance level, test the null hypothesis that the number of transaction made for each
of the 5 days is the same. Assume that this week is typical of all weeks in regards to the use of this
ATM.

2. A random sample of 300 adults was selected and they were asked if they favor school teachers
punishing students for violence and lack of discipline. Does the sample provide sufficient information
to conclude that the two attributes, gender and opinions of adults, are dependent? Use a 1%
significance level.

Gender Opinions
In Favor (F) Against (A) No Opinion (N) Total
Men (M) 93 70 12 175
Women (W) 87 32 6 125
Total 180 102 18 300

Chapter 13: The Chi-Square Tests 192


CHAPTER 14:

ANALYSIS OF VARIANCE

Chapter 14: Analysis of Variance 193


Overview
This chapter explains the concepts of analysis of variance (ANOVA). The concepts discussed in this
chapter are F-distribution, one-way and two-way analysis of variance. The chapter concludes with a
summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Use the one-way ANOVA technique to determine if there is a significant difference among three
or more means.
2. Use the two-way ANOVA technique to determine if there is a significant difference in the main
effects of interaction.

14.1 Introduction
We have studied how to compare two population means in Chapter 9. In this chapter, we develop a
method for comparing more than two population means. This method is called analysis of variance
(ANOVA). For example, a marketing specialist wishes to see whether there is a difference in the average
time a customer has to wait in a checkout line in three large self-service department stores. The specialist
will use the ANOVA technique that is a F − test to compare three or more means.

The analysis of variance that is used to compare three or more means is called a one-way analysis of
variance or one-factor design or completely randomized design since it contains only one variable
or one factor. In the previous example, the variable is the three department stores. The ANOVA can be
extended to studies involving two variables; such studies are called two-way analysis of variance.

14.2 The F−distribution


When an F–test is used to compare three or more population means, the technique is called analysis of
variance (ANOVA). Now we look at some of the characteristics of F-distribution.

14.2.1 Characteristics of F-Distribution


1. The values of F cannot be negative.
2. The distribution is positively skewed or skewed to the right.
3. The mean of F−distribution is approximately equal to 1.
4. The F-distribution is a family of curves based on the degrees of freedom numerator (d.f.N)
and degrees of freedom denominator (d.f.D).

F-distribution curves

Chapter 14: Analysis of Variance 194


14.3 One-Way Analysis of Variance
Assumptions for this F -Test
 The populations from which the samples were obtained must be normally distributed.
 The samples must be independent of each other.
 The variances of the populations must be equal.

For a test of the difference among three or more means, the following hypotheses should be used:

H0 : 1  2   k (i. e. all population means are equal)


H1 : At least one mean is different from others.

Although means are being compared in this F test, variances are used in the test instead of the means.

With the F test, two different estimates of the population variances are made. The first estimate is called
the between-group variance, and it involves computing the variance by using the means of the groups.
The formula for computing the between-group variance is given by:

n X  X GM 
2


2 i i
s ,
k 1
B

Where,

ni is the sample size for the ith group.


X i is the sample mean for the ith group.
X GM is the grand mean.
k is the number of groups.

The second estimate, the within-group variance, and it involve computing the variance by using all the
data and is not affected by differences in the means. The formula for computing the within-group variance
is given by:

sW2 
  n  1 s
i
2
i
,
N k

Where,
si2 is the sample variance for the i th group.

The test value for this test is computed by:

sB2
F ,
sW2

Where,
 d.f.N. = k – 1, where k is the number of groups.
 d.f.D. = N – k, where N is the sum of the sample sizes of the groups.

Chapter 14: Analysis of Variance 195


EXAMPLE 14−1

A marketing specialist wishes to see whether there is a difference in the average time a customer has to
wait in a checkout line in three large self-service department stores. The times (in minutes) are shown
on the next slide. Is there a significant difference in the mean waiting times of customers for each store
using   0.05 ?

Store A Store B Store C


3 5 1
2 8 3
5 9 4
6 6 2
3 2 7
1 5 3

SOLUTION

Step 1: State the hypothesis and identify the claim.

H0 : 1  2  3
H1 : At least one mean is different from others. (claim)

Step 2: Find the critical value. Since k  3, N  18,

d.f.N.  k  1  2
d.f.D.  N  k  15

The critical value is 3.6823, obtained from the F- distribution table with   0.05.

Rejection region

0 3.6823

Chapter 14: Analysis of Variance 196


Step 3: Compute the test value.

The sample size, mean and variance of each group:

Store A Store B Store C


3 5 1
2 8 3
5 9 4
6 6 2
3 2 7
1 5 3
n1  6 n2  6 n3  6
X 1  3.33 X 2  5.83 X 3  3.33
s12  3.47 s22  6.17 s32  4.27

The grand mean:

X GM 
X 
75
 4.17.
N 18

Between-group variance:
n X 
2
 X GM

2 i i
s
k 1
B

6(3.33  4.17) 2  6(5.83  4.17) 2  6(3.33  4.17) 2



2
 12.5

Within-group variance:

2
s 
 (n i  1) si2
N k
W

5(3.47)  5(6.17)  5(4.27)



555
 4.6367

Therefore,
sB2 12.5
F 2   2.7.
sW 4.6367

Step 4: Since the test value F  2.7, lies in the non-rejection region, we do not reject null hypothesis
and conclude that there is no significant difference in the mean waiting times of customers for each store.

Chapter 14: Analysis of Variance 197


Step 5: There is not enough evidence to support the claim.

The numerator of between-group variance is called the sum of squares between groups, denoted by
SS B and the numerator of with-group variance is called the sum of squares within groups or sum of
squares for the error denoted by SSW . Therefore,

SS B SSW
sB2  and sW2  .
k 1 N k

These two variances are sometime called mean squares, denoted as MS B and MSW . Therefore,
MS B
F .
MSW

These terms are used to summarize the analysis of variance in a table given below:

Source Sum of squares d.f Mean squares F


Between SS B k 1 SS MS B
MS B  B F
k 1 MSW
Within (error) SSW N k SSW
MSW 
N k
Total

The ANOVA table for Example 14-1 is:

Source Sum of squares d.f Mean squares F


Between 25 2 12.5 2.7
Within (error) 69.5505 15 4.6367
Total 94.5505 17

Note: Most computer programs provide ANOVA summary table as the output.

EXAMPLE 14−2

A researcher wishes to see whether there is any difference in the weight gains of athletes following one
of the three special diets. Athletes are randomly assigned to 3 groups and placed on the diet for 6 weeks.
The weight gains (in pounds) are shown here. At   0.05, can the researcher conclude that there is a
difference in the diets?

SOLUTION

Step 1: State the hypothesis and identify the claim.

H0 : 1  2  3
H1 : At least one mean is different from others. (claim)

Chapter 14: Analysis of Variance 198


Step 2: Find the critical value. Since k  3, N  14,

d.f.N.  k  1  2
d.f.D.  N  k  11

The critical value is 3.9823, obtained from the F- distribution table with   0.05.

Rejection region

0 3.6823

Step 3: Compute the test value.

The sample size, mean and variance of each group:

Diet A Diet B Diet C


3 10 8
6 12 3
7 11 2
4 14 5
8
6
n1  4 n2  6 n3  4
X1  5 X 2  10.17 X 3  4.5
s  3.33
2
1 s  8.17
2
2 s32  7

The grand mean:

X GM 
X 
99
 7.07.
N 14

Between-group variance:
n X  X GM 
2


2 i i
s
k 1
B

4(5  7.07) 2  6(10.17  7.07) 2  4(4.5  7.07) 2



2
 50.61

Chapter 14: Analysis of Variance 199


Within-group variance:

2
s 
 (n i  1) si2
N k
W

3(3.33)  5(8.17)  3(7)



11
 6.53

Therefore,
sB2 50.61
F   7.75.
sW2 6.53

Step 4: Since the test value F  7.75, lies in the rejection region, we reject null hypothesis and conclude
that there is significant difference in the diets.

Step 5: There is enough evidence to support the claim.

Therefore, the ANOVA table for Example 14-2 is:

Source Sum of squares d.f Mean squares F


Between 101.22 2 50.61 7.75
Within (error) 71.84 11 6.53
Total 173.06 13

EXAMPLE 14−3

A research organization tested microwave ovens. At   0.05, is there a significant difference in the
average prices of the three types of oven?

Watts
1000 900 800
270 240 180
245 135 155
190 160 200
215 230 120
250 250 140
230 200 180
200 140
210 130

A computer printout for this exercise is shown below. Use a P-value method and the information in the
printout to test the claim.

Chapter 14: Analysis of Variance 200


Descriptive Statistics
Mean n Std. Dev
233.3 6 28.23Group 1
203.1 8 39.36Group 2
155.6 8 28.21Group 3
194.1 22 44.79Total

ANOVA table
Source SS df MS F p-value
Treatment 21,729.73 2 10,864.867 10.12 .0010
Error 20,402.08 19 1,073.794
Total 42,131.82 21

SOLUTION

Step 1: State the hypothesis

H0 : 1  2  3
H1 :
At least one mean is different from others.

Step 2: Find the test value. From the ANOVA table, the test value is F  10.12.

Step 3: Compute the P-value.

The P-value from the ANOVA table is 0.001.

Step 4: Since the P-value <  , we reject null hypothesis and conclude that there is a significant
difference in the average prices of the three types of oven.

EXAMPLE 14−4

A set of data involving 4 different types of food A, B, C, & D tried on 20 Chicks is given below. All the 20
chicks are treated alike in all respects except the feeding treatments and each feeding treatment is given
to 5 randomly selected chicks. Perform an analysis of variance and test the hypothesis that the mean
weight gain is same for all the 4 foods.

The weight gain (in gm) of chicks due to the foods was recorded as:

55(A) 42( C) 30(B) 85(D)


169 (D) 42(A) 81( C) 154(D)
61(B) 21(A) 169(D) 52(A)
49(A) 97 ( C) 95 (C ) 63(B)
137(D) 112(B) 89(B) 92(C)

Chapter 14: Analysis of Variance 201


SOLUTION

Arranging the data and computing the sample size, mean and variance of each group we have:

Food
Food A Food B Food C Food D
55 61 42 169
49 112 97 137
42 30 81 169
21 89 95 85
52 63 92 154
n1  5 n2  5 n3  5 n4  5
X1  43.8 X 2  71 X 3  81.4 X 4  142.8
s12  185.7 s22  962.5 s32  523.3 s42  1218.2

Step 1: State the hypothesis

H0 : 1  2  3  4 (mean weight gains are same)


H1 : At least one mean is different from others.

Step 2: Find the critical value. Since k  4, N  20,

d.f.N.  k  1  3
d.f.D.  N  k  16

The critical value is 3.2389, obtained from the F- distribution table with   0.05.

Rejection region

0 3.2389

Step 3: Compute the test value.

The sample size, mean and variance of each group are given in the table above

Chapter 14: Analysis of Variance 202


The grand mean:

X GM 
X 
1695
 84.75.
N 20

Between-group variance:

n X  X GM 
2


2 i i
s
k 1
B

5(43.8  84.75) 2  5(71  84.75) 2  5(81.4  84.75) 2  5(142.8  84.75) 2



3
26234.95
  8744.98
3

Within-group variance:

2
s 
 (n i  1) si2
N k
W

4(185.7)  4(962.5)  4(523.3)  4(1218.2)



16
11558.80
  722.43
16

Therefore,
sB2 8744.98
F   12.105.
sW2 722.43

Step 4: Since the test value F  12.105, lies in the rejection region, we reject null hypothesis and
conclude that the mean weight gain is different for all the 4 foods.

Note:
When the null hypothesis is rejected using the F-test, we conclude that the means are not equal, but we
still do not know where the difference exist. Several procedures have been developed to determine where
the significant differences in the mean lie after the ANOVA have been performed. Amongst the most
commonly used tests are the Scheffe test and the turkey test.

Chapter 14: Analysis of Variance 203


14.4 Two-Way Analysis of Variance
The analysis of variance technique shown previously is called a one-way analysis of variance since there
is only one independent variable. The two-way ANOVA is an extension to the one-way ANOVA; it involves
two independent variables. The independent variables are also called factors.

The two-way analysis of variance is quite complicated, and many aspects of the subject should be
considered in the two-way ANOVA. For this purpose, in this chapter only brief introduction to the subject
will be given.

In a two-way ANOVA, the researcher is able to test the effects of two independent variables or factors on
one dependent variable. In addition, the interaction effect of the two variables can also be studied.

For example, suppose a researcher wishes to test the effect of two varieties (say variety A and B) of
potatoes and two different locations (say location 1 and 2) on the yielding capacity of potatoes. The two
factors or independent variables are the varieties of potatoes and the different locations, while the
dependent variable is the yield of potatoes. The factors such as water, temperature, and sunlight are held
constant.

To conduct the experiment, the researcher sets up the following groups:

Group 1: Potato Variety A, location 1


Group 2: Potato Variety A, location 2
Group 3: Potato Variety B, location 1
Group 4: Potato Variety B, location 2

The two-way ANOVA has several hypotheses, that is for the above example the hypothesis are as
follows:

Variety of Potatoes:
H 0 : There are no significant differences in yielding capabilities of the 3 varieties.
H1 : There are significant differences in yielding capabilities of the 3 varieties.

Different Locations:
H 0 : There are no significant differences between the locations
H1 : There are significant differences between the locations

Interaction Effect:
H 0 : There is no interaction effect between the variety of potato and different location on the yield.
H1 : There is interaction effect between the variety of potato and different location on the yield.

Note:
1. The groups for such a two-way ANOVA are sometimes called treatment groups.
2. This design is called a 2  2 design, since each variable consists of two levels that are two
different treatments.

Chapter 14: Analysis of Variance 204


In general, two-way ANOVA summary table is shown below:

Source Sum of d.f Mean F


squares squares
A SS A a 1 SS FA  MS A / MSW
MS A  A
a 1
B SS B b 1 SS B FB  MSB / MSW
MS B 
b 1
A B SS AB (a  1)(b  1) SS A B FAB  MS AB / MSW
MS A B 
(a  1)(b  1)
Within (error) SSW ab(n  1) SSW
MSW 
ab(n  1)
Total

In the table:

SS A  sum of squares for factor A


SS B  sum of squares for factor B
SS A B  sum of squares for interaction
SSW  sum of squares for error (or within)
a  number of levels of factor A
b  number of levels of factor B
n  number of subjects in each group

The computational procedure for the two-way ANOVA is quite lengthy. For this reason, the sum of
squares will be provided in a summary ANOVA table and you should be able to interpret the table and
summarize the results.

EXAMPLE 14−5

A researcher wishes to see whether the type of gasoline used and the type of automobile driven have
any effect on the gasoline consumption. Two types of gasoline (regular and high –octane) and two types
of automobiles (2-wheel drive and 4-wheel drive) will be used in each group. There will be two
automobiles in each group, so there are 8 used in total. Analyse the data shown below, using two-way
ANOVA with   0.05.

The data (in miles per gallon) and the summary table are shown here.

Type of Automobile
Gas 2-wheel drive 4-wheel drive
Regular 26.7 28.6
25.2 29.3
High-octane 32.3 26.1
32.8 24.2

Chapter 14: Analysis of Variance 205


Source SS d.f MS F
Gasoline 3.920
Automobile 9.680
Interaction 54.080
Within (error) 3.300
Total 70.980

SOLUTION

Step 1: State the hypothesis.

Hypothesis for Gasoline:


H 0 : There is no difference between the means of gasoline consumption for two types of gasoline.
H1 : There is difference between the means of gasoline consumption for two types of gasoline.

Hypothesis for automobile:

H 0 : There is no difference between the means of gasoline consumption for two types of automobiles.
H1 : There is difference between the means of gasoline consumption for two types of automobiles.

Hypothesis for interaction Effect:

H 0 : There is no interaction effect between type of gasoline used and type of automobile a person drives
on gasoline consumption.
H1 : There is interaction effect between type of gasoline used and type of automobile a person drives
on gasoline consumption.

Step 2: Find the critical values for each F-test. Factor A is the type of gasoline and it has two levels
(regular and high-octane), so a  2. Factor B is the type of automobile driven and it has two levels (2-
wheel and 4-wheel drive), so b  2. The number of data values in each group is 2, so n  2. The
degrees of freedom is given as follows:

Gasoline: a  1  2  1  1.
Automobile: b  1  2  1  1.
Interaction: (a  1)(b  1)  (2  1)(2  1)  1.
Error: ab(n  1)  2(2)(2  1)  4.

Therefore, the critical value is given as follows:

Gasoline: Using   0.05, d.f.N  1, and d.f.D  4, we get 7.71.


Automobile: Using   0.05, d.f.N  1, and d.f.D  4, we get 7.71.
Interaction: Using   0.05, d.f.N  1, and d.f.D  4, we get 7.71.

Chapter 14: Analysis of Variance 206


Step 3: Complete the ANOVA table and compute the test values.

Therefore, the complete ANOVA table is:

Source SS d.f MS F
Gasoline 3.920 1 3.920 4.752
Automobile 9.680 1 9.680 11.733
Interaction 54.080 1 54.080 65.552
Within (error) 3.300 4 0.825
Total 70.980 7

The test values are as follows:

Gasoline: F  4.752,
Automobile: F  11.733,
Interaction: F  65.552,

Step 4: Reject or do not reject null hypothesis and conclusion

Gasoline:
Since the test value F  4.752, fall in the acceptance region, therefore do not reject null hypothesis and
we conclude that there is no difference between the means of gasoline consumption for two types of
gasoline.

Automobile:
Since the test value F  11.733, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference between the means of gasoline consumption for two types of
automobiles.

Interaction:
Since the test value F  65.552, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is interaction effect between type of gasoline used and type of automobile a person
drives on gasoline consumption.

EXAMPLE 14−6

A medical researcher wishes to test the effects of two different diets and two different exercise programs
on glucose level in a person’s blood. The glucose level is measured in milligrams per deciliter (mg/dl).
Three subjects are randomly assigned to each group. Analyse the data shown below, using two-way
ANOVA with   0.05.

The data (in milligrams per deciliter) and the summary table are shown here.

Diet
Exercise A B
Program
I 62, 64, 66 58, 62, 53
II 65, 68, 72 83, 85, 91

Chapter 14: Analysis of Variance 207


Source SS d.f MS F
Exercise 816.75
Diet 102.083
Interaction 444.083
Within (error) 108
Total 1470.916

SOLUTION
Step 1: State the hypothesis.

Hypothesis for Exercise:


H 0 : There is no difference in the means for the glucose levels of the persons in the two exercise
programs.
H1 : There is difference in the means for the glucose levels of the persons in the two exercise programs.

Hypothesis for diet:


H 0 : There is no difference in the means for the glucose levels of the persons in the two diet programs.
H1 : There is difference in the means for the glucose levels of the persons in the two diet programs.

Hypothesis for interaction Effect:


H 0 : There is no interaction effect between type of exercise program and type of diet on a person’s
glucose level.
H1 : There is interaction effect between type of exercise program and type of diet on a person’s glucose
level.

Step 2: Find the critical values for each F-test. Factor A is the type of Exercise and it has two levels (I
and II), so a  2. Factor B is the type of diet and it has two levels (A and B), so b  2. The number of
data values in each group is 3, so n  3. The degrees of freedom is given as follows:

Exercise: a  1  2  1  1.
Diet: b  1  2  1  1.
Interaction: (a  1)(b  1)  (2  1)(2  1)  1.
Error: ab(n  1)  2(2)(3  1)  8.

Therefore, the critical value is given as follows:

Exercise: Using   0.05, d.f.N  1, and d.f.D  8, we get 5.32.


Diet: Using   0.05, d.f.N  1, and d.f.D  8, we get 5.32.
Interaction: Using   0.05, d.f.N  1, and d.f.D  8, we get 5.32.

Chapter 14: Analysis of Variance 208


Step 3: Complete the ANOVA table and compute the test values.

Therefore, the complete ANOVA table is:

Source SS d.f MS F
Exercise 816.75 1 816.75 60.5
Diet 102.083 1 102.083 7.56
Interaction 444.083 1 444.083 32.9
Within (error) 108 8 13.5
Total 1470.916 11

The test values are as follows:

Exercise: F  60.5,
Diet: F  7.56,
Interaction: F  32.9,

Step 4: Reject or do not reject null hypothesis and conclusion

Exercise:
Since the test value F  60.5, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference in the means for the glucose levels of the persons in the two exercise
programs.

Diet:
Since the test value F  7.56, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference in the means for the glucose levels of the persons in the two diet
programs.

Interaction:
Since the test value F  32.9, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is interaction effect between type of exercise program and type of diet on a person’s
glucose level.

14.5 Summary
This chapter explains the concepts of analysis of variance (ANOVA). The concepts discussed in this
chapter are F-distribution, one-way and two-way analysis of variance.

Chapter 14: Analysis of Variance 209


EXERCISES

1. The amount of sodium (in milligrams) in one serving for a random sample of three different kinds of
foods is listed below. At the 0.05 level of significance, is there sufficient evidence to conclude that a
difference in mean sodium amounts exists among condiments, cereals, and desserts?

Condiments Cereals Desserts


270 260 100
130 220 180
230 290 250
180 290 250
80 200 300
70 320 360
200 140 300
160

2. How does the two-way ANOVA differ from the one-way ANOVA.

3. A contractor wishes to see whether there is a difference in the time (in days) it takes two
subcontractors to build three different types of homes. At   0.05, analyse the data shown in the
table below, using a two-way ANOVA table provided.

Subcontractor Home type


I II III
A 25,28,26,30,31 30,32,35,29,31 43,40,42,49,48
B 15,18,22,21,17 21,27,18,15,19 23,25,24,17,13

ANOVA Summary Table

Source SS d.f. MS F
Subcontractor 1672.553
Home type 444.867
Interaction 313.267
Within (error) 328.800
Total 2759.487

Chapter 14: Analysis of Variance 210


REFERENCES
1. Bluman, A. G. (2012). Elementary Statistics - A Step By Step Approach, 8th edition, McGraw-Hill.
2. Mann, P. S. (2010). Introductory Statistics, 7th edition, John Wiley & Sons, New York.
3. Selvanathan, A., Selvanathan, S., Keller, G. and Warrack, W. (2006). Australian Business
Statistics, 4th edition, Cengage Learning Pty, Australia.

References 211
APPENDIX A:

ANSWERS TO EXERCISES

Appendix A: Answers to Exercises


Chapter 1: Introduction to Statistics

1.
A. Descriptive
B. Inferential
C. Descriptive
D. Inferential

2.
A. Number of tutorial session a student missed
B. Statistic: mean number of missed classes for the 35 students is 2 days; parameter: the average
number of tutorial session a student missed in 2016 with the previous year’s average of 3 classes.

3.
A. Interval
B. Ordinal
C. Nominal
D. Ratio
E. Ratio

4.
A. Systematic
B. Stratified
C. Cluster
D. Random
E. Systematic

5. True

6. The confounding variable influences the dependent variable, but cannot be separated from the
independent variable.

7.
A. Discrete
B. Continuous
C. Discrete
D. Continuous

8.
A. Quantitative
B. Qualitative
C. Qualitative
D. Quantitative
E. Quantitative

9.
A. Observational
B. Observational
C. Experimental

Appendix A: Answers to Exercises A-1


Chapter 2: Frequency Distributions and Graphs

1. A. & B.

Category Frequency ( f ) Relative frequency Percentage


A 5 0.20 20
B 7 0.28 28
O 9 0.36 36
AB 4 0.16 16

f  25

C. 56%

Pie Chart showing the


distribution of blood type
A
O
20%
36%

B
AB 28%
16%
D.

Bar graph showing the


10 distribution of blood type

8
Frequency

6
4
2
0
A B AB O
Blood Type
E.

Appendix A: Answers to Exercises A-2


2. A. & B.
44  12
Approx. class width   7.
5

Class frequency Class boundaries Relative frequency Percentage


12−18 4 11.5−18.5 0.20 20
19−-25 6 18.5.5−25.5 0.30 30
26−32 7 25.5−32.5 0.35 35
33−39 2 32.5−39.5 0.10 10
40−46 1 39.5−46.5 0.05 5
 f  20

C.

Histogram

8
7
6
Frequency

5
4
3
2
1
0
11.5 18.5 25.5 32.5 39.5 46.5
Amount of protein (g)

Frequency Polygon

8
7
6
Frequency

5
4
3
2
1
0
8 15 22 29 36 43 50

Amount of protein (g)

Appendix A: Answers to Exercises A-3


D.

Ogive

25
Cumulative freq.

20

15

10

0
11.5 18.5 25.5 32.5 39.5 46.5

Amount of protein (g)

3.
A.

Stem Leaf
4 58
5 245889
6 11245667
7 0357789
8 02366
9 15

B. 16/30 = 8/15
C. In the 60s
D. Approximately symmetric with the peak in the 60s.

Appendix A: Answers to Exercises A-4


Chapter 3: Data Description
1. A.

X
X 
13857
 1154.8
n 12
947  956
Median   951.5
2
Mode  856

X  X 
2
X

2215 1124130.06
1888 537655.56
1477 103845.06
1059 9168.06
977 31595.06
956 39501.56
947 43160.06
924 53245.56
899 65408.06
856 89251.56
856 89251.56
803 123728.06
13857 2309440.25

 X  X 
2
2309440.25
s2    209994.57  s  209994.57  458.25
n 1 11
B.
Q1  or P25  is obtained by
25(12)
P25  th term
100
 3rd term

The value of 3rd term can be approximated by the average of 3rd and 4th terms in the ranked
data. Therefore,
856  899
Q1   877.5
2

Q3  or P75  is obtained by:


75(12)
P75  th term
100
 9 th term

Appendix A: Answers to Exercises A-5


The value of 9th term can be approximated by the average of 9th and 10th terms in the ranked
data. Therefore,
1059  1477
Q3   1268.
2

Q2  P75  Median  951.5

P40 is obtained by:


40(12)
P40  th term
100
 48th term

The value of 4.8th term can be approximated by the 5th term in the ranked data.

Therefore,
P40  924

6  0.5
Percentile rank of 956  100%  54.2%.
12
C.

Step 1: The interquartile range (IQR), IQR = 1268 – 877.5 = 390.5.

Step 2: The interval is:

877.5 1.5  390.5  x  1268  1.5  390.5  291.75  x  1853.75

Step 3: Check the data set for any data values that fall outside the interval from 291.75 to
1853.75. Since the data values 1888 and 2215 are outside this interval, it can be considered an
outlier.

D.
The Five-Number Summary (Note: The data should be arranged in ascending order first)
1. The lowest value is 803;
2. Q1  877.5 ;
3. The median is 951.5;
4. Q3  1268 ;
5. The highest value is 2215;

Appendix A: Answers to Exercises A-6


BoxPlot showing the cash compensation received in 2009

500 1000 1500 2000 2500


cash compensation (in $000s)

Since the median is to the left of the center of the box or the right line is larger than the left line, the
distribution is positively skewed.

2.

Number of
Frequency ( f ) Midpoints ( X m ) fX m f X m2
Employees

1 – 10 32 5.5 176 968


11– 20 34 15.5 527 8168.5
21 – 30 14 25.5 357 9103.5
31 – 40 12 35.5 426 15123
41 – 50 18 45.5 819 37264.5
 fX m  2305 f X 2
m  70627.5

A.


 fX m

2305
 20.95
N 110

Modal group: 11-20

To find the median we need to use the percentile graph:

Appendix A: Answers to Exercises A-7


Number of People Employed
120

100
cumulative percentage

80

60

40

20

0
0.5 10.5 20.5 30.5 40.5 50.5
Number of people

From the graph, the median is approximately 18.

B.

  fX 
2

f X (2305)2
m
2
m  70627.5 
2  N  110  202.98    202.98  14.25
N 110

3. The weights for the results are in the following ratio:

Exam 1: Exam 2 : Final Exam  1  1  2

Hence,
1(73)  1(67)  2(85)
Weighted mean   77.5
11 2

4.
A. Firm B has a larger wage bill since it has the wage bill 200x$185=$37000, while Firm A has the
wage bill 100x$196=$19600

B. Firm B has greater variability.

Appendix A: Answers to Exercises A-8


Chapter 4: Probability (Part I)

1.

2.
A. Classical, Empirical and Subjective.
B. Empirical, an experiment is performed.

3.
A. Simple event, since a simple event is an event with only one sample point.
B. A compound, since a compound event is an event with more than one sample point.

4.
A. C and S are non-mutually exclusive events since P  C  S   0

B.
P  C  S   0.85, P( S )  0.61, P(C )  0.31,
P(C  S )  P(C )  P( S )  P  C  S   0.31  0.61  0.85  0.07

C.

P  C ' S '   1  P(C  S )  1  0.85  0.15

5. Let the events, D = card is Diamond, Q = card is Queen, A= card is 3 and B= card is 6

A. P  Q   4 52

B. P  A  D   1
52

C. P( A  D)  P( A)  P( D)  P  A  D   4 52  13 52  152  16 52  413

D. Since P  A  B   0 , P  A  B   P( A)  P( B)  4 52  4 52  8 52  213

Appendix A: Answers to Exercises A-9


6. Let the events, N = Nurse, D = Doctor and F = Female;

A. P  F   10
13

B. PN  F   7
13

C. P  N  F   P( N )  P( F )  P  N  F   8  10 7  11
13 13 13 13

7.

A. 13
50
B. Since each die can land in six different ways, and two dice are rolled, the sample space can be
presented by a rectangular array as follows:

Die 2
3. Die 1
1 2 3 4 5 6

1 (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

2 (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)

3 (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)

4 (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

5 (5,1) (5,2) (5,3) (5,4) (5,5) (5,6)

6 (6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

Hence, classical probability of rolling to sum of 7 is 6 36  1 6

C. The empirical probability value is quite different from the theoretical or classical probability value
due to the fact that the number of trial in the experiment to determine the empirical probability is
small. If this trial number increases, the empirical probability value will tend to approach or getting
closer to the theoretical probability value.

Appendix A: Answers to Exercises A-10


Chapter 5: Probability (Part II)

1. Let P = guinea pig is pregnant and P' = guinea pig is not pregnant. Note that the events of picking
the first pig will affect the second and the third picks as well, hence, the events are dependent.
5 4 3 5
A. P( P  P  P)      0.179
8 7 6 28
 5 4 3  15
B. P( P  P  P ')  3       0.536
 8 7 6  28
C. P(atleast one pig is pregnant)  1- P(none are pregnant)  1- P( P ' P ' P ')

 3 2 1  55
 1-       0.982
 8 7 6  56
2. Let C = student own car and C' = student does not own car.
A. P(C  C  C )  0.1 0.1 0.1  0.001

B. P(C  C  C ')  3  0.1 0.1 0.9   0.027

C. P(atleast one student own a car)  1- P(none own cars)  1- P(C ' C ' C ')

 1-  0.9  0.9  0.9   0.271

3. Let W = student works; W' = student does not work; M = student is male and F = student is female;
A.
250
i. P(W )   0.625
400
250 180 120 310
ii. P(W  M )  P(W )  P(M )  P(W  M )      0.775
400 400 400 400
90
iii. P( F  W ')   0.225
400
P(W ' M ) 60
iv. P(W ' | M )    0.333
P( M ) 180
60
B. Not mutually exclusive events since P( M  W ')  0.
400
C. Dependant events since P( F W ')  P( F )  P(W ')

4. Let B = black marble; R = red marble; A= urn 1 and C = urn 2;


 1 5   1 1  23
P( B)  P( A  B)  P(C  B)           0.958
 2 8   2 4  24

Appendix A: Answers to Exercises A-11


5. Let the events, R= card is Red, H = card is Heart and D= card is Diamond;
 13 12   13 13  25
A. P( R  H )  P ( H  H )  P ( D  H )           0.123
 52 51   52 51  204
13 25 25
B. P( H  R)     0.123
52 51 204
6. 2 × 4 = 8 different routes possible.
7.
A. Since there are 6 distinct letters in the word SUNDAY, the number of distinct ways the letters
can be arranged is 6! = 720.

B. 5! = 120.
C. 4! = 24.

8.
9  9 9 
A.          381024 .
 4 3  2
9  5  2
B.          1260 .
 4 3  2
9.
C5  6C0 126
9

A. P  All dentists   15   0.042


C5 3003

C2  6C3 720
9

B. P  2 dentists  3 doctors   15   0.24


C5 3003

10. Let E : number contains 7 and E : number does not contain 7;


97 97
n  E   9 ; P  E   7 . Hence, P( E )  1  P  E   1  7  0.522
7

10 10

Appendix A: Answers to Exercises A-12


Chapter 6: Discrete Probability Distributions

1. c  4c  9c  16c  1  30c  1 c  1
30
2.
A. P( X  2)  0.35
B. P( X  2)  P( X  2)  P( X  3)  0.35  0.3  0.65
C. P( X  1)  P( X  1)  P( X  0)  0.2  0.15  0.35

3.    X .P( X ) 0(0.15)  1(0.2)  2(0.35)  3(0.3)  1.8

 2   X 2 .P( X )   2  02 (0.15)  12 (0.2)  22 (0.35)  32 (0.3)   1.82  4.3  1.82  1.06

  1.06  1.03
4.
A.

X 0 1 2 3
P( X ) 0.008 0.096 0.384 0.512

B.    X .P( X ) 0  0.096  2(0.384)  3(0.512)  2.4

 2   X 2 .P( X )   2  0  0.096  22 (0.384)  32 (0.512)   2.42  0.48    0.48  0.693

5.
Joe’s gain X $4 −$1
P(X) 6/36 30/36

 6   30  1
E ( X )  4    1     $0.17
 36   36  6
 loss in playing 15 games is 15  $0.17  $2.50

Appendix A: Answers to Exercises A-13


6.
X 0 1 2 3
P( X ) 1
56
15
56
30
56
10
56

7. Let X be the number of customers having purchased shoes. In this case: n = 20, p=0.3, and q = 0.7
and P( X  0)  20C2 (0.3)0 (0.7)20  0.0008; P( X  1)  20C2 (0.3)0 (0.7)20  0.0068;
Therefore,
P( X  2)  1   P( X  0)  P( X  1)   1  (0.0008  0.0068)  0.9924

8. Let X be the number of articles submitted for publication. In this case: n = 8, p=0.11, and q = 0.89.
A. P( X  4)  8C4 (0.11)4 (0.89)4  0.00643

B. P( X  1)  1  P( X  0)  1   8C0 (0.11)0 (0.89)8   0.606

9. Here n=400, p=0.03, and q=0.97 and using the formulas, we have

  n  p  400  (0.03)  12
 2  n  p  q  4  (0.03)  (0.97)  11.64    11.64  3.41

10. Here n=1000, p=0.17, and q=0.83;

A. E( X )  n  p  1000  (0.17)  170

B.  2  n  p  q  1000  (0.17)  (0.83)  141.10    141.10  11.88

Appendix A: Answers to Exercises A-14


Chapter 7: The Normal Distribution

1.
A.

−0.21 1.57
P (−0.21 < Z < 1.57) = 0.0832 + 0.4418 = 0.525.

B.

1.43

P (Z <1.43) = 0.5 + 0.4236 = 0.9236.

2.
A. ZO  1.16.

B. ZO  2.101.

3. n  385,   65,   10;

Appendix A: Answers to Exercises A-15


 47  65 67  65 
A. P(47  X  67)  P  z 
 10 10 

 P  1.8  z  0.2   0.4641  0.0793  0.543  54%

 86  65 
B. P( X  86)  P  z    P  z  2.1  0.5  0.4821  0.0179  1.8%
 10 
4.
A. P( X  6200)  P( z  0.25)  0.4
6200  6300
B. P( X  6200)  P( z   P( z  1.58)  0.0571
400 / 40

5.   51,   14;

A. P  58  X  65   P  0.5  z  1
 0.3419  0.1915
 0.1498
0.1498  200  30students.

Approximately 30 students will score between 58 and 65.

B.

 z  1.64 or 1.65
Area = 0.05 x  51
 1.64 x  74
14

0 z
Therefore 74 is the minimum mark to obtain an A+.

6. It is given that P( X  4)  0.3 and P( X  4.53)  0.2.


Using P( X  4)  0.3, we get the z  0.52 from the tables and hence we get
4
0.52  . (1)

Appendix A: Answers to Exercises A-16


Similarly using P( X  4.53)  0.2 so we get another equation
4.53  
0.84  . (2)

We have to find Solving the equations (1) and (2) simultaneously we get   312 kg and   168
kg.

7. P  X  263 . The z value of X  26.3 is


263  25
z  194 .
3  20
Hence, P  X  263  P  z  1.94   05  04738  00262.

Appendix A: Answers to Exercises A-17


Chapter 8: Confidence Intervals and Sample Size

1. Confidence level is the probability that the interval estimate will contain the parameter and confidence
interval is a specific interval estimate of a parameter determined from the data obtained from a
sample and using specific confidence level.

2. Given n  8, X  13.1 and s  4.1. Since  is unknown and n  30 , we use t /2 in the formula.

Using d . f  7 and   0.05, we get t /2  2.365. Hence the 95% confidence interval of  is

 4.1   4.1 
13.1  2.365      13.1  2.365  
 8  8
9.7    16.5.

3. Given   900 and E  5. For 99% confidence level, we have z /2  2.58. Hence the minimum
sample size is

 2.58  900 
2

n   239.6  240.


 5 
4. We determine that X  33.5, s  27.678, and n  10 ;
A.   X  33.5.
B. Since  is unknown and n  30 , we use t /2 in the formula. Using d. f  9 and   0.02,
we get t /2  2.821. Hence the 98% confidence interval of  is
 27.678   27.678 
33.5  2.821     33.5  2.821 
 10   10 
33.5  24.691    33.5  24.691
4.
8.8    58.2

5. Given that X  26.1,  4.2 and n  30. For 99% confidence level, we have z /2  2.58. Hence,
 4.2   4.2 
5. 26.1  2.58      26.1  2.58  
 30   30 
26.1  1.98    26.1  1.98
6.
24.12    28.02
6. Given that pˆ  0.29, qˆ  0.71 and E  0.05. For 90% confidence level, we have z /2  1.65.
Hence,
2
z 
2
 1.65 
ˆ ˆ   2    0.29  0.71 
n  pq   224.23  225.
 E   0.05 

Appendix A: Answers to Exercises A-18


7. Given that pˆ  157  0.314 and qˆ  0.686.
500
A. For 99% confidence level, we have z /2  2.58 . Hence,

p  pˆ  z 2
ˆˆ
pq
 0.314  2.58
 0.134 0.686   0.314  0.054  0.260  p  0.368
n 500
B. Given that E  0.02. We have
2
z 
2
 2.58 
ˆ ˆ   2    0.314  0.686  
n  pq   3584.5  3585
 E   0.02 
8. Given that n  995, pˆ  0.291 and qˆ  0.709. For 90% confidence level, we have z /2  1.65 .

The 90% confidence interval for p is:

0.291(0.709) 0.291(0.709)
 0.291-1.65  p  0.291-1.65
995 995
 0.291  0.0238  p  0.291  0.0238
 0.2672  p  0.3148

Appendix A: Answers to Exercises A-19


Chapter 9: Hypothesis Testing (Part I)

1. The null hypothesis is a statistical hypothesis that states there is no difference between a parameter
and a specific value or there is no difference between two parameters. The alternative hypothesis
specifies a specific difference between a parameter and a specific value, or that there is a difference
between two parameters. For example, H 0 :   5 and H1 :   5.

2.
H 0 :   9.5hrs
A.
H1 :   9.5hrs (two-tailed test)

H 0 :   $105
B.
H1 :   $105 (left-tailed test)

H 0 :   $39000
C.
H1 :   $39000 (right-tailed test)

H 0 :   10mins
D.
H1 :   10mins (left-tailed test)

3. Step 1: State the hypothesis.


H0 :   29
H1 :   29 (claim)

Step 2: Find the critical value.

We know that  is given, hence use z-test.


Since   005 and the test is two–tailed, find  / 2  0.05 / 2  0.025. So the area on
the left tail and the right tail are 0.025. The z-values are z  1.96. So the critical values
are z  1.96. See the diagram below.

Critical Critical
region region
Acceptance
region

−1.96 1.96
Step 3: Compute the test statistics value. We find that X  29.45,   29,  2.61 and n  30.
29.45  29
Therefore, z   0.944
2.61
30

Appendix A: Answers to Exercises A-20


Step 4: Make a decision

Since the test value z  0.944 , falls in acceptance region, the decision is: “Do not reject H 0 ”.
Step 5: Summarize the results.
It cannot be concluded that the average height differs from 29 inches.

4.
A.
Step 1: State the hypothesis.
H 0 :   200 (claim)
H1 :   200

Step 2: Find the critical value.


We know that  is unknown but n  30 , we use z-test. Since   005 and the test is two–
tailed, find  / 2  0.05 / 2  0.025. So the area on the left tail and the right tail are 0.025.
The z-values are z  1.96. So the critical values are z  1.96. See the diagram below.

Critical Critical
region region
Acceptance
region

−1.96 1.96
Step 3: Compute the test statistics value. Given that X  198.2,   200, s  3.3 and n  40.
198.2  200
Therefore, z   3.45
3.3 / 40
Step 4: Make a decision

Since the test value z  3.45 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results.
There is enough evidence to reject the claim that adult dogs fed a special diet will have weight of 200
Ibs.

B. Use Confidence Interval Method.

Step 1: State the hypothesis


H 0 :   200 (claim)
H1 :   200

Appendix A: Answers to Exercises A-21


Step 2: Find the confidence interval.

We know that  is unknown but n  30 , we use z /2 in the formula. For 95% confidence level,
we have z /2  1.96 . Given that X  198.2, s  3.3 and n  40. The confidence interval of 
 3.3   3.3 
is 198.2  1.96      198.2  1.96    197.18    199.22
 40   40 

Step 3: Make a decision to reject or do not reject null hypothesis.


Since the confidence interval does not contain the hypothesized value  = 200, the decision is:
“Reject H 0 ”.
Step 4: Summarize the results.
There is enough evidence to reject the claim that adult dogs fed a special diet will have weight
of 200 Ibs.

5. Step 1: State the hypothesis.


H 0 :   980 (claim)
H1 :   980

Step 2: Find the test value. We find that X  985,   980,   15 and n  150.

X   985  980
Therefore, z    2.357
 n 15 150

Step 3: Compute the P-value.

−2.357 2.357

P(0  z  2.357)  .493  P-value  2(0.5  .493)  2  0.007  0.014.

Step 4: Make a decision

Since the P-value is less than   0.05, the decision is “Reject H0 ”


Step 5: Summarize the results.
There is not enough evidence to support the claim of Pacific Tapioca manufacturer that the packets
of tapioca chip they make have a mean weight of 980g.

Appendix A: Answers to Exercises A-22


6. Step 1: State the hypothesis.
H 0 :   3000
H1 :   3000(claim)

Step 2: Find the critical value.


We know that  is given, hence use z-test. Since   005 and the test is right–tailed. The critical
value is z  1.65. See the diagram below.

Critical
Acceptance
region
region

1.65

Step 3: Compute the test statistics value. Given that X  3120,   3000,  578 and n  60.
3120  3000
Therefore, z   1.61 .
578 / 60

Step 4: Make a decision

Since the test value z  1.61 , falls in acceptance region, the decision is: “Do not reject H 0 ”.

Step 5: Summarize the results.


There is not enough evidence to support the claim that the average production has increased.

Appendix A: Answers to Exercises A-23


Chapter 10: Hypothesis Testing (Part II)

1. Step 1: State the hypothesis.


H0 : p  0.25
H1 : p  0.25(claim).

Step 2: Find the test value. We find that pˆ  63 / 200  0.315, p  0.25, q  0.75 and n  200.
Therefore,
0.315  0.25
z  2.12 .
(0.25)(0.75) / 200

Step 3: Compute the P-value.

2.12
The area on the right of z  2.12 is 0.0170. Since it is a right- tailed test, the P-value is 0.0170.
Step 4: Make a decision

Since the P-value is less than   0.05, the decision is “Reject H0 ”


Step 5: Summarize the results.
There is enough evidence to support the attorney’s claim.

2. Step 1: State the hypothesis.


H0 : p  0.68
H1 : p  0.68 (claim)

Step 2: Find the critical value.


Since   001 and the test is two–tailed, find  / 2  0.01/ 2  0.005. So the area on the left tail
and the right tail are 0.005. The z-values are z  2.58. So the critical values are z  2.58 .

Critical Critical
region region
Acceptance
region

-2.58 2.58

Step 3: Compute the test statistics value. We find that

Appendix A: Answers to Exercises A-24


92
pˆ   0.613, p  0.68, q  0.32 and n  150.
150
0.613  0.68
Therefore, z   1.76
0.68  0.32
150
Step 4: Make a decision

Since the test value z  1.76 , falls in acceptance region, the decision is: “Do not reject H 0 ”.
Step 5: Summarize the results.
Therefore, it does not suggest a difference from the national proportion.

3.
A. Use Confidence Interval Method.

Step 1: State the hypothesis

H 0 :   3.18 (claim)
H1 :   3.18

Step 2: Find the confidence interval.

We know that  is unknown and n  30 , we use t /2 in the formula. Since   005 and the
test is two–tailed, so the area on the left tail and right tail are 0.05/2 = 0.025. Using the t-
distribution table with d . f  23 and   005 (or 2 p  0.05) , we find that t 2  2.069 .
We find that X  3.833, s  1.435 and n  24. The confidence interval of  is:

 1.434563   1.434563 
3.833  2.069      3.833  2.069    3.23    4.44
 24   24 

Step 3: Make a decision to reject or do not reject null hypothesis.


Since the confidence interval does not contain the hypothesized value  = 3.18, the decision is:
“Reject H 0 ”.

Step 4: Summarize the results.


We conclude that the average family size differs from the national average.

B. Use a traditional method.

Step 1: State the hypothesis.


H 0 :   3.18 (claim)
H1 :   3.18

Step 2: Find the critical value.

Appendix A: Answers to Exercises A-25


We know that  is unknown and n  30 , we use t-test. Since   005 and the test is two–
tailed, so the area on the left tail and right tail are 0.05/2 = 0.025. Using the t-distribution table
with d . f  23 and   005 (or 2 p  0.05) , so the critical values are t  2.069 . See the
diagram below.

Step 3: Compute the test statistics value. We find that

X  3.833,   3.18, s  1.435 and n  24. Therefore,


X  3.833  3.18
t   2.23
s n 1.434563 / 24

Step 4: Make a decision

Since the test value t  2.23 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results.
We conclude that the average family size differs from the national average.

4. Use Confidence Interval Method.

Step 1: State the hypothesis


H 0 :   700 (claim)
H1 :   700

Step 2: Find the confidence interval. We know that that X  685, s  125, n  400. And for 98%
confidence level, we have z /2  2.33 . Thus, the 98% confidence interval for  is
 125   125 
685  2.33      685  2.33    638.4    731.6
 400   400 

Step 3: Make a decision to reject or do not reject null hypothesis.


Since the confidence interval contains the hypothesized value  = 700, the decision is: “Do not reject
H 0 ”.
Step 4: Summarize the results.
There is enough evidence to support the claim that a factory worker in Vanuatu earns an average of
$700 per week.

Appendix A: Answers to Exercises A-26


5. Step 1: State the hypothesis.

H0 : p  0.35 (claim)
H1 : p  0.35
Step 2: Find the critical value.

Since   0025 and the test is left–tailed, so the area on the left tail is 0.025.The z-value is
z  1.96 .The critical value is z  1.96 See the diagram below.

Critical
region
Acceptance
region
7. −1.96

Step 3: Compute the test statistics value. We find that


112
pˆ   0.38, p  0.35, q  0.65 and n  400.
400

0.28  0.35
Therefore, z   2.935
0.35  0.65
400

Step 4: Make a decision

Since the test value z  2.935, falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results.
It can be concluded that the company should not market this yogurt.

Appendix A: Answers to Exercises A-27


Chapter 11: Testing the Equality of Two Population Means

1. Step 1: State the hypothesis.

H0 : 1  2
H1 : 1  2 (Claim)

Step 2: Find the critical value.


We know that 1 and  2 are given, we use z-test. Since   001 and the test is two–tailed, find
 / 2  0.01/ 2  0.005. So the area on the left tail and the right tail are 0.005. The z-values are
z  2.58. So the critical values are z  2.58 . See the diagram below.

Critical Critical region


region
Acceptance
region

−2.58 2.58

Step 3: Compute the test statistics value.


We know that x1  59235, n1  40,1  8945 and x2  52487, n2  35,  2  10125. Hence,

(59, 235  52, 487)  0


z  3.04
89452 10,1252

40 35

Step 4: Make a decision

Since the test value z  3.04 , falls in rejection region, the decision is: Reject H 0 ”.
Step 5: Summarize the results.
Therefore, it can be concluded that there is difference in mean earnings between male and female
college graduates.

2. Use a traditional method.

Step 1: State the hypothesis.


H 0 : 1  2
H1 : 1  2 (claim)

Appendix A: Answers to Exercises A-28


Step 2: Find the critical value.

We know that 1 and  2 are unkown and unequal, we use t-test. Since   005 and the test is
two–tailed, so the area on the left tail and right tail are 0.05/2 = 0.025. Using the t-distribution table
with d . f  24 and   005 (or 2 p  0.05) , the critical values are t  2.797 . See the
diagram below.

Critical Critical
region region
Acceptance
region

−2.797 2.7967

Step 3: Compute the test statistics value. We know that x1  223, n1  30, s1  6.1 and
 223  229   0  3.731
x2  229, n2  25, s2  5.8. Therefore, t 
6.12 5.82

30 25
Step 4: Make a decision

Since the test value t  3.731 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results
There is enough evidence to support the claim that there is significant difference in cholesterol
levels between the two groups.

3. A. Use P − value method.

Step 1: State the hypothesis.

Step 2: Find the test value. We know that x1  68.2, n1  20, 1  2.5 and
 68.2  67.5  0  0.834
x2  67.5, n2  20,  2  2.8. Therefore, z  .
2.52 2.82

20 20
Step 3: Compute the P-value.

0.834

Appendix A: Answers to Exercises A-29


The area on the right of z  0.834 is 0.2022. Since it is a right- tailed test, the P-value is 0.2022.
Step 4: Make a decision

Since the P-value is greater than   0.05, the decision is “Do not reject H0 ”
Step 5: Summarize the results.
There is enough evidence to reject the claim that the athletes are taller than non-athletes.

B. Use confidence interval method.

Step 1: State the hypothesis


H 0 : 1  2
H1 : 1  2 (claim)

Step 2: Find the confidence interval. We know that x1  68.2, n1  20, 1  2.5 and
x2  67.5, n2  20, 2  2.8. And for 95% confidence level, we have z /2  1.96 . Thus, the 95%
confidence interval for  is

 68.2  67.5  1.96 2.52  2.82       68.2  67.5  1.96 2.52  2.82
20 20 1 2
20 20
0.945  1  2  2.345

Step 3: Make a decision to reject or do not reject null hypothesis.

Since the confidence interval contains the hypothesized value 1  2  0, the decision is: “Do not
reject H 0 ”.

Step 4: Summarize the results.


There is enough evidence to reject the claim that the athletes are taller than non-athletes.

4. Use a traditional method.

Step 1: State the hypothesis.


H 0 : 1  2
H1 : 1  2 (claim)

Step 2: Find the critical value.


We know that 1 and  2 are unkown but n1  30 and n2  30 , we use z-test. Since   002
and the test is two–tailed, find  / 2  0.02 / 2  0.01. So the area on the left tail and the right tail
are 0.01. The z-values are z  2.33. So the critical values are z  2.33 . See the diagram below.

Appendix A: Answers to Exercises A-30


Critical Critical region
region
Acceptance
region

−2.33 2.33

Step 3: Compute the test statistics value. We know that x1  39420, n1  35, s1  1659 and
x2  30215, n2  40, s2  4116.
 39420  30215  0  12.99
Therefore, z 
16592 41162

35 40

Step 4: Make a decision

Since the test value z  12.99 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results
There is enough evidence to conclude that there is significant difference between the two states
chemists’ salaries.

5. Use a traditional method.

Step 1: State the hypothesis.


H 0 : 1  2
H1 : 1  2 (claim)

Step 2: Find the critical value.

We know that 1 and  2 are unkown and unequal, we use t-test. Since   005 and the test is
right–tailed, so the area on the right tail is 0.05. Using the t-distribution table with d . f  23 and
  005 (or p  0.05) , the critical values are t  1.714 . See the diagram below.

Critical region
Acceptance
region

1.714

Appendix A: Answers to Exercises A-31


Step 3: Compute the test statistics value. We know that x1  48256, n1  26, s1  3912.40 and
x2  45633, n2  24, s2  5533.

Therefore, t 
 48256  45633  0  1.92.
3912.42  55332
26 24

Step 4: Make a decision

Since the test value t  1.92 , falls in rejection region, the decision is: “Reject H 0 ”.

Step 5: Summarize the results


It be concluded that the mean of the salaries of the primary school teachers is greater than the
mean of the salaries of the secondary school teachers.

Appendix A: Answers to Exercises A-32


Chapter 12: Correlation and Regression

1. Simple regression has one dependent and one independent variable whereas multiple regression
has one dependent variable and two or more independent variables.

2.

6  4115.025    5709  5.236 


A. r   0.833.
6  7609557    5709 2  6  5.067302    5.236 2 
  
B.

Step 1: State the hypothesis.


H 0 :   0 (There is no significant relationship between the variables)
H1 :   0 (There is significant relationship between the variables)

Step 2: Find the critical value.

Since   005 and the test is two–tailed, so the area on the left tail and right tail are 0.05/2 = 0.025.
Using the t-distribution table with d . f  4 and   005 (or 2 p  0.05) , so the critical values are
t  2.766 . See the diagram below.

Critical Critical region


region
Acceptance
region

−2.766 2.766

Step 3: Compute the test statistics value. We know that r  0.833 and n  6. Therefore,

62
t  0.833  3.01
1  (0.8332 )

Step 4: Make a decision

Since the test value t  3.01 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results.
There is a significant relationship between the number of eggs produced and price per dozen.

Appendix A: Answers to Exercises A-33


C. Since the coefficient of correlation is significant in part B is significant, we can write down the
equation of the regression line as:

a
 5.236  7609557    5709  4115.025  1.252
and
6  7609557    5709 
2

6  4115.025    5709  5.236 


b  0.000398
6  4115.025    5709 
2

The regression line is


y '  1.252  0.000398 x.

D. The coefficient of determination, r 2   0.833  0.694. This means that 69.4% of the total
2

variation is explained by the linear regression model.

E. When x  1600 million eggs, the price per dozen is y '  1.252  0.000398(1600)  0.615
per dozen

3. When person is 32 years old, x1  32 , and has a GPA of 3.4, x1  3.4, the income is
y '  34127  132(32)  20805(3.4)  40834.

4.
Step 1: State the hypothesis.
H 0 :   0 (There is no significant relationship between the variables)
H1 :   0 (There is significant relationship between the variables)

Step 2: Find the critical value.

Since   005 and the test is two–tailed, so the area on the left tail and right tail are 0.05/2 = 0.025.
Using the t-distribution table with d . f  4 and   005 (or 2 p  0.05) , so the critical values are
t  2.766 . See the diagram below.

Critical Critical region


region
Acceptance
region

−2.766 2.766

Step 3: Compute the test statistics value. We know that r  0.982 and n  6. Therefore,
62
t  0.982  10.4
1  0.9822

Appendix A: Answers to Exercises A-34


Step 4: Make a decision

Since the test value t  10.4 , falls in rejection region, the decision is: “Reject H 0 ”.

Step 5: Summarize the results.


There is significant relationship between the number of cars owned by the companies and its
annual income.

Appendix A: Answers to Exercises A-35


Chapter 13: The Chi-Square Tests

1.
Step 1: State the hypothesis.
H 0 : The number of transaction made for each of the 5 days is the same.
H1 : The number of transaction made for each of the 5 days is not the same.

Step 2: Find the critical value.


From the Chi-Square distribution table, the critical value using d . f .  k  1  5  1  4 and
  0.025 , we get
0.025
2
= 11.143

Step 3: Computation of Expected Frequency and Test Statistics Value:

If H 0 is true, the expected number of transaction made for each of the 5 days is the same.
E = The expected number of transaction made per day= Total number of transaction made =
No. of days
1200
 240
5

Observed frequency Expected frequency (O  E)2 (O  E )2


(O) (E )
E
253 240 169 0.704
197 240 1849 7.704
204 240 1296 5.400
279 240 1521 6.338
267 240 729 3.038
(O  E )2
 E = 23.183

(O  E )2
 The test value is   
2
= 23.183
E
Step 4: Make a decision and summarize the results.

Since the test value  2  23.183, falls in rejection region, the decision is: “Reject H 0 ”.
Therefore, we conclude that the number of transaction made using this ATM for each of the 5 days
is not the same.

2.
Step 1: State the hypothesis.

H 0 : The two attributes, gender and opinions of adults, are independent


H1 : The two attributes, gender and opinions of adults, are dependent

Appendix A: Answers to Exercises A-36


Step 2: Find the critical value.

From the Chi-Square distribution table, the critical value using:

d . f .  (r 1)(c 1)  (2 1)(3 1)  2 and   0.01 , we get


 0.01
2
= 9.210

Step 3: Computation of Expected Frequency and Test Statistics Value:

The expected frequency is computed by:

(Row total)(Column total)


E
Sample size
Observed frequency Expected frequency
(E )
(O  E)2 (O  E ) 2
(O)
E
93 175 180 144 1.37
 105
300
70 175 102 110.25 1.85
 59.5
300
12 175 18 2.25 0.21
 10.5
300
87 125 180 144 1.92
 75
300
32 125 102 110.25 2.59
 42.5
300
6 125 18 2.25 0.30
 7.5
300
(O  E )2
 E  8.24

(O  E )2
 The test value is   
2
= 8.24
E

Step 4: Make a decision and summarize the results.

Since the test value  2  8.24, falls in acceptance region, the decision is: “Do not reject H 0 ”.
Therefore, we conclude that the two attributes, gender and opinions of adults, are independent.

Appendix A: Answers to Exercises A-37


Chapter 14: Analysis of Variance

1.
Step 1: State the hypothesis and identify the claim.
H0 : 1  2  3
H1 : At least one mean is different from others. (claim)

Step 2: Find the critical value. Since k  3, N  22,


d.f.N.  k  1  2
d.f.D.  N  k  22  3  19

The critical value is 3.5219, obtained from the F- distribution table with   0.05.

Rejection region

0 3.5219

Step 3: Compute the test value.

The sample size, mean and variance of each group:

Condiments Cereals Desserts


270 260 100
130 220 180
230 290 250
180 290 250
80 200 300
70 320 360
200 140 300
160

n1  7 n2  7 n3  8

X1  165.714 X 2  245.714 X 3  237.5


s12  5695.238 s22  3928.571 s32  7335.714

The grand mean:

X GM 
 X  4780  217.273.
N 22

Appendix A: Answers to Exercises A-38


Between-group variance:

n X  X GM 
2


2 i i
s
k 1
B

7(165.714  217.273) 2  7(245.714  217.273) 2  8(237.5  217.273) 2



2
 13771.799

Within-group variance:

sW2 
 (n  1)s i
2
i

N k
6(5695.238)  6(3928.571)  7(7335.714)

667
 5741.729

Therefore,
sB2 13771.799
F   2.3985.
sW2 5741.729

Step 4: Since the test value F  2.3985, lies in the acceptance region, the decision is: “Do not
reject H 0 ”.

Step 5: There is not enough evidence to support the claim that there are difference in mean sodium
amounts exists among condiments, cereals, and desserts.
2. The two- way ANOVA allows the researcher to test the effects of two independent variables and a
possible interaction effect. The one-way ANOVA can test the effects of only one independent
variable.

3.
Step 1: State the hypothesis

Hypothesis for Subcontractors:


H 0 : There is no difference between the means of days taken by two subcontractors to build.
H1 : There is difference between the means of days taken by two subcontractors to build.

Hypothesis for Home type:


H 0 : There is no difference between the means of days taken to build three types of home.
H1 : There is difference between the means of days taken to build three types of home.

Hypothesis for interaction effect:


H 0 : There is no interaction effect between the home type and subcontractors on the days to build.
H1 : There is interaction effect between the home type and subcontractors on the days to build.

Appendix A: Answers to Exercises A-39


Step 2: Find the critical values for each F-test. Factor A is the subcontractors and it has three levels
(A and B) so a  2. Factor B is the type of home and it has three levels (I, II, and III), so b  3. The
number of data values in each group is 5, so n  5. The degrees of freedom is given as follows:

Subcontractor: d . f .N  a 1  2  1  1.
Home type: d . f .N  b 1  3 1  2.
Interaction: d . f .N  (a 1)(b 1)  (2 1)(3 1)  2.
Error: d . f .D  ab(n 1)  2(3)(5  1)  24.

So the critical value is given as follows:

Subcontractor: Using   0.05, d.f.N  1, and d.f.D  24, we get 4.2597.


Home type: Using   0.05, d.f.N  2, and d.f.D  24, we get 3.4028.
Interaction: Using   0.05, d.f.N  2, and d.f.D  24, we get 3.4028.

Step 3: Complete the ANOVA table and compute the test values.

So the complete ANOVA table is:

Source SS d.f. MS F
Subcontractor 1672.553 1 1672.553 122.084
Home type 444.867 2 222.4335 16.236
Interaction 313.267 2 156.6335 11.433
Within (error) 328.800 24 13.7
Total 2759.487 29

The test values are as follows:


Subcontractor: F  122.084,
Home type: F  16.236,
Interaction: F  11.433,

Step 4: Reject or do not reject null hypothesis and conclusion

Subcontractor:
Since the test value F  122.084, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference between the means of days taken by two subcontractors to build.

Home type:
Since the test value F  16.236, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference between the means of days taken to build three types of home.

Interaction:
Since the test value F  11.433, fall in the rejection region, therefore reject null hypothesis and we
conclude that there interaction effect between the home type and subcontractors on the days to build.

Appendix A: Answers to Exercises A-40

You might also like