Professional Documents
Culture Documents
© 2018 The University of the South Pacific (USP). Except where otherwise noted, this work is licensed
under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a
copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/.
This work was carried out with the aid of a grant from the Office of the Deputy Vice-Chancellor Learning,
Teaching and Student Services (LTSS), USP as part of the Open Educational Resources (OER) Course
Conversion project.
Disclaimer
“The publication is released for educational purposes, and all information provided is in an ‘as is’ basis.
Although the author and publisher have made every effort to ensure that the information in this publication
was correct at the time of going to press, the author and publisher do not assume and hereby disclaim
any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such
errors or omissions result from negligence, accident, or any other cause. Any views expressed in the
publication are that of the author, and do not necessarily reflect the views of The University of the South
Pacific. All products and services mentioned are owned by their respective copyright holders, and mere
presentation in the publication does not mean endorsement by The University of the South Pacific.
Derivatives of this work are not authorized to use the logo of The University of the South Pacific.”
Preface
This book entitled “Basic Statistics - A Step By Step Approach” is designed to be used in a basic statistics
course. It introduces students to basic concepts in statistics using a step by step approach and will be a
very handy resource for a first course in statistics. The book also includes lots of examples and exercises
with solutions to help students understand concepts better. This book has fourteen chapters and an
appendix.
This book makes reference to the Eton Statistical and Maths Tables (4th Edition) published by Pearson
New Zealand.
INTRODUCTION TO STATISTICS
Objectives
After completing this chapter, you should be able to:
1. Define some statistical terms.
2. Differentiate between descriptive and inferential statistics.
3. Identify types of variables.
4. Identify the measurement levels for each variable.
5. Identify the sampling technique used.
6. Differentiate between an observational and an experimental study.
1.1 Introduction
This chapter provides an introduction to statistics. It explains the basic terms and concepts such as
statistics; branches of statistics; types of variables; techniques to collect data; sampling techniques;
observational and experimental studies.
The word statistics, however, is used to mean two different things. One of the definitions is that statistics
are numbers measured for some purpose. A more complete definition is the following:
Statistics can be defined as the science of conducting studies to collect, organize, summarize,
analyze and draw meaningful conclusions from it.
EXAMPLE 1−1
To describe the two branches of statistics, it is useful to know the definition of the following statistical
terms:
A variable is a characteristic or attribute that can assume different values. Variables whose
values are determined by chance are called random variables.
Data are the values (measurement or observations) that the variables can assume. A collection
of data values forms a data set. Each value in the data set is called a data value or a datum.
A population consists of all subjects that are being studied. A sample is a subset of the
population.
Statistic is a characteristic or a measure obtained by using the data values from a sample.
Parameter is a characteristic or a measure obtained by using all the data values from a
population.
A survey that includes every member of the population is called a census. The technique of
collecting information from a portion of the population is called a sample survey.
EXAMPLE 1−2
EXAMPLE 1−3
An ANZ Bank Manager with 12,000 customers does a survey to gauge customer views on Internet
Banking which would incur less bank fees. In the survey, 21% of the 300 customers interviewed said that
they were interested in Internet Banking.
A. What is the population of interest?
B. What is the sample?
C. Is the value 21% a parameter or a statistic?
A data set in its original form is called raw data and is usually very large. Consequently, such a data set
is not very helpful in making conclusions or decisions. It is easier to draw conclusions from summary
tables and diagrams than from the raw data. So, we reduce the raw data by constructing frequency tables,
drawing graphs, or calculating summary measures such as mean and standard deviations. The portion
of statistics that deals with this type of statistical analysis is called descriptive statistics. For example,
consider the national census conducted by Fiji Government. Results of this census give the average age,
income and other characteristics of the Fiji Population. It is an example of descriptive statistics because
population is being used here.
EXAMPLE 1−4
In each of the following statements, tell whether descriptive or inferential statistics have been used.
A. In the year 2020, 20000 students will be enrolled at USP.
B. Income for the cane farmers in Fiji were 1.2 million in 2017.
C. Research stated that the shape of a person’s ears is related to the person’s aggression.
D. The national average annual medicine expenditure per person is $1052.
SOLUTION A. Inferential
B. Descriptive
C. Inferential
D. Descriptive
Qualitative variables are those that cannot be assigned numerical values but are placed into distinct
categories determined by some attributes or characteristic. For example, gender can be categorized by,
either male or female. Colour, religion and geographical location are other examples. Quantitative
variables are variables that can take up numerical values, hence can be ranked or ordered. For example,
temperature can have any numerical value and be ordered from the either highest to lowest or vice versa.
Other examples include age, height, weight and volume.
Since quantitative variables can assume any numerical value, it is important to further categorize them
into discrete and continuous variables.
We have seen the classification of variables into qualitative and quantitative variables. We now look at
how the variables can be classified by how they are categorized, counted or measured and for this we
use the levels of measurement. There are four levels of measurement: nominal, ordinal, interval, and
ratio. These go from lowest level to highest level. Data is classified according to the highest level which
it fits. Each additional level adds something the previous level didn't have.
Nominal level of measurement is used to describe qualitative variables, which cannot be assigned
numerical values and hence cannot be ordered. Examples whereby nominal measurement may be
applied include subject areas of study (Mathematics, Algebra, Statistics, Language, etc.) or colours (blue,
red, green, etc.)
The nominal level of measurement classifies data into mutually exclusive (non-overlapping),
exhausting categories in which no order or ranking can be imposed on the data.
Ordinal level of measurement describes qualitative data as well, but unlike nominal level of measurement,
it allows categorization that can be sorted or ranked. It is important to note, however, that precise
differences between the ranks do not exist. Examples include the grade letters (A, B, C, D, E, and F) and
positions achieved in a marathon (first, second, third, etc.)
The ordinal level of measurement classifies data into categories that can be ranked or ordered, but
precise differences do not exist between these categories.
The interval level of measurement differs from ordinal in the sense that precise differences do exist
between data. For example, the variable age can be ranked, and there exists a precise difference
between any two age values (2 units between the ages of 19 and 21). However, no meaningful 0 exists.
For example, in temperature measurement, 0°C does not mean no heat at all. Likewise, an IQ score of
0 does not mean the subject’s intelligence is zero.
The interval level of measurement ranks data, and precise differences between units of measure do
exist. However, there is no meaningful zero.
The ratio level of measurement has all properties of interval level but has a meaningful zero. Examples
include height, weight, salary, etc. A true ratio also exists between two measurements of the population.
The ratio level of measurement possesses all characteristics of the interval level of measurement
but there exists a meaningful zero.
Indicate which of the following variables are quantitative and which are qualitative. Classify the
quantitative variables as discrete or continuous and classify the qualitative variables as nominal or
ordinal.
A. Number of road accidents in a year.
B. The time a student takes to walk to school.
C. Religion of people in Fiji.
D. Length of jump by athletes in long jump event.
E. Number of errors on each page of a book.
F. Grades of students at USP (A+, A, B+, B, etc.).
G. Shoe size of a person.
H. Education level of a sugarcane farmer.
SOLUTION A. Quantitative because the variable is numerical and discrete because the variable
is countable.
B. Quantitative because the variable is numerical and continuous because the
variable is measured.
C. Qualitative because the variable is categorical and nominal because the variable
has no order or ranking.
D. Quantitative because the variable is numerical and continuous because the
variable is measured.
E. Quantitative because the variable is numerical and discrete because the variable
is countable.
F. Qualitative because the variable is categorical and ordinal because the variable
has order or ranking.
G. Quantitative because the variable is numerical and continuous because the
variable is measured.
H. Qualitative because the variable is categorical and ordinal because the variable
has order or ranking.
EXAMPLE 1−6
Classify each of the following attributes as either categorical or numerical. For those that are numerical,
determine whether they are ratio or interval and for those that are categorical, determine whether they
are nominal or ordinal.
Disadvantages Costs more than the other two methods; interviewer maybe
biased on the selection of subjects or could even be
unknowingly influencing the responses of the interviewee.
2. Telephone Interviews This is an interview through phone where the researcher asks
a standard set of questions.
Advantage Costs less than personal interview; subjects tend to be more
candid in their opinions.
Data can also be collected in other ways, such as surveying records or direct observation of situations.
Since some populations being studied may be too large for descriptive statistics to be applied (i.e. collect
data about each and every individual subject), inferential statistics is applied instead. Therefore, samples
must be selected from the population very carefully and evenly to obtain the best applicable data. The
sampling techniques mainly used are random, systematic, stratified and cluster sampling.
Random sampling selects subjects by using chance methods or random numbers. E.g. numbering each
subject in the population and placing the numbered cards in a bowl/box/hat, then randomly selecting the
number of required cards from the bowl/box/hat. Random number tables are used by statisticians instead.
Systematic sampling requires each subject of the population to be numbered, and then select every kth
subject. The first member of the sample, however, will be selected at random. E.g. a sample of 50 is
needed from a population size of 2000; since 2000 ÷ 50 = 40, every 40th subject would be selected after
the first subject is randomly selected.
Stratified sampling divides the population into groups called strata according to some attribute or
characteristic important to the study, and then samples are drawn from each group. Samples drawn from
the strata are randomly selected. E.g. a study to determine obesity in the population is done and subjects
maybe divided into groups by gender, age group or ethnicity.
Cluster sampling divides the population into groups called clusters by some means such as geographical
location, schools or city/suburb. Then, some of the clusters are randomly selected and all subjects are
used from these clusters in the study. This sampling technique is normally used when the population size
is very large or when population is distributed across a large geographical area. This method is also cost-
effective. E.g. to study the eating habits of Fijians, certain villages or settlements maybe randomly
selected and all individuals for those villages used in the study.
EXAMPLE 1−7
SOLUTION
A. Cluster
B. Systematic
C. Random
D. Stratified
Advantages:
Usually occur in natural settings; they can be carried out in situations where it would be unethical or
downright dangerous for a researcher to conduct an experiment; can be carried out using variables that
cannot be manipulated by the researcher.
Disadvantages:
The researcher does not control variables; the data of other variables that have significant influences on
outcome variable may not be collected; can be expensive and time-consuming; and there are no
guarantees on the accuracy of the collected data.
For example, determining the effect of beauty products on the skin. Here, the beauty product is an
independent variable.
Advantages:
Can decide on how to select subjects; can decide on how to assign them to specific groups; control or
manipulate the independent variable.
Disadvantages:
Results may occur in unnatural settings; the behaviours of the participants in the study may be changed
because they knew they would participate in the study beforehand (this is known as Hawthorne effect);
presence of other variables (confounding variables) that the researcher did not choose but they influence
the outcome variable.
An independent variable in an experimental study is the one the one that is being manipulated by the
researcher. It is also called the explanatory variable. The resultant variable is called the dependent
variable or the outcome variable.
EXAMPLE 1−8
SOLUTION A. experimental
B. observational
C. ovservational
1.9 Summary
This chapter introduced statistics. We have studied basic terms and concepts such as what is statistics;
why study statistics; variable; population and sample; statistic and parameter; census and sample survey;
descriptive and inferential statistics; the types of variable i.e. quantitative/qualitative, discrete/continuous,
nominal/ordinal/ratio/interval; techniques to collect data; the sampling techniques i.e.
simple/systematic/stratified/cluster; observational and experimental studies. This chapter will further help
readers understand the rest of the chapters better.
EXERCISES
2. A study of ST130 students in 2016 was undertaken to compare the average number of tutorial
session a student missed in 2016 with the previous year’s average of 3 classes. A random sample
of 35 students was surveyed and it was found that the mean number of missed classes for the 35
students is 2 days. Answer the following questions:
A. Interviewing every 5th customer leaving a theatre about the movie they had seen.
B. The country is divided into economic classes and a sample is chosen from each class to be
surveyed?
C. A researcher divided subjects into 4 geographical groups and then selected all members from
a randomly selected group as samples.
D. A Math’s tutor at USP is interested in the mean number of days an ST130 student is absent
from tutorial classes. The tutor takes her sample by gathering data on 5 randomly selected
students from ST130 course.
E. Questioning every 14th customer leaving a theatre about the movie they had seen.
5. When running an experimental study, the group that is manipulated can be called the treatment
group. True or False.
7. For each of the following, state whether the variable is continuous or discrete:
A. A researcher on the busy street of Suva City asking random people that pass by how many pets
they have, then taking this data and using it to decide if there should be more pet food stores in
that area.
B. A researcher trying to determine the effects that eating strictly organic foods has on overall
health. The researcher finds 200 individuals, where 100 of them have eaten organically for the
past three years, and the other 100 have not eaten organically in the past three years.
C. A researcher trying to study the relation between the internet access and exam score of the
students. To do this, the students were randomly assigned to two groups, and only one group
was given the access. After 4 months, the exam score of two groups were compared.
FREQUENCY DISTRIBUTION
AND GRAPHS
Objectives
After completing this chapter, you should be able to:
1. Organize data using frequency distribution tables.
2. Represent qualitative data graphically using bar graphs, Pareto charts, time series graphs and
pie charts.
3. Represent quantitiave data graphically using histograms, frequency polygons and ogives.
4. Identify shape of frequency distributions.
5. Draw and interpret a stem and leaf plot.
2.1 Introduction
When conducting statistical studies, researchers collect data for a particular variable under study. For
example, if a researcher wishes to study the number of people who were infected with tuberculosis in
Suva over the past two years, he/she has to collect data from various doctors, hospitals and health
departments. In Chapter 1, we have learned some techniques the researchers can use to collect data.
The data that has not been processed for use which is in its original form is called raw data (sometimes
called source data or atomic data).
Since little information can be obtained from looking at the raw data, the researcher organizes the raw
data in some meaningful way and the most convenient method of organizing data is to construct a
frequency distribution.
After organizing the data, the researcher must present them in such a way that could be understood by
those who will benefit from the study. The most useful method of presenting the data is by constructing
statistical charts and graphs. There are many different types of charts and graphs, and each one has a
specific purpose. In Chapter 2, you will learn the statistical methods of organizing and presenting data.
There are three types of frequency distributions categorical (qualitative data) or ungrouped (quantitative
data) or grouped frequency distribution (quantitative data).
A frequency distribution is the organisaton of raw data in table form using classes and frequencies.
A categorical frequency distribution lists all categories and the number of elements that belong to
each of the categories.
EXAMPLE 2−1
A sample of 30 children from a primary school was selected, and was asked what their favourite fruit was.
They were given 3 options: apples, oranges and bananas. Their response is given below:
SOLUTION
Step 1: Choose the categories/classes for the distribution. Since there are 3 options: apples, oranges,
and bananas, these will be the categories/classes.
Step 2: Make a table as shown:
Step 3: Tally the data and put the results in the tally column.
Step 4: Count the tallies and place the results in the frequency column.
Step 5: Find the total for the frequency column. The completed table is shown.
f 30
It is easier to gather information from the categorical frequency distribution than the raw data. That is, it
can be concluded that
Most children prefer banana.
20 children prefer banana or orange.
The relative frequency and percentage distributions of Example 2-1 are given below:
Banana 11 11/30=0.37 37
Orange 9 9/30=0.30 30
Total f 30 1 100
From the relative frequency and the percentage distribution, the following information can be obtained:
The relative frequency of apple is 0.33, which means that 33% of the children prefer apple.
70% of the children prefer banana or apple.
An ungrouped frequency distribution lists all categories and the number of elements that belong to
each of the categories.
The following example illustrates how an ungrouped frequency distribution table is constructed.
A group of 24 customers of a popular restaurant were asked on their reviews of the quality of service.
They had to rate the service provided by the restaurant on the scale of 1−10. Below are their ratings:
10 6 7 8 4 1 7 6
9 10 8 2 3 3 6 5
1 4 5 7 6 10 9 6
Construct a frequency distribution table for these data.
SOLUTION
Choose the classes for the distribution. Since the rating is on the scale 1−10 these will be the classes.
The procedure for constructing ungrouped frequency distribution is same as categorical frequency
distribution. The complete ungrouped frequency distribution table is shown below with the relative
frequencies and percentages.
Total f 24 1 100
A grouped frequency distribution organizes numerical data where the raw data is grouped using
class intervals of equal width.
To give an example of a grouped frequency distribution, let us consider the weights (in kg) of 50 pieces
of luggage with class intervals as follows:
Weight (kg) Class No. of pieces
Boundaries
7− 9 6.5 − 9.5 2
13 − 15 12.5 – 15.5 14
16 – 18 15.5 – 18.5 19
19 − 21 18.5 – 21.5 7
Total 50
The class boundaries are used to separate the classes so that there are no gaps in the frequency
distribution.
EXAMPLE 2−3
Peter picked 40 leaves from a mango tree and measured their lengths in centimetres. He collected the
following data:
19, 16, 13, 17, 7, 8, 4, 18, 10, 17, 18, 9, 12, 5, 9, 9, 16, 1, 8, 17
1, 10, 5, 9, 11, 15, 6, 14, 9, 17, 1, 12, 5, 16, 4, 16, 8, 15, 14, 17
SOLUTION
Total 40
EXAMPLE 2−4
The table provides the distribution of the ages of new employees joined at a factory.
30 − 39 21
40 − 49 4
50 − 59 2
60 − 69 1
A. Obtain the class boundaries and class marks of the class intervals.
B. What is the upper class limit of the class 30 – 39?
C. What is the lower class boundary of the class 50 – 59?
D. What is the class mark of the class 40 – 49?
A. The class boundaries and class marks are given in the following table:
Class interval Class boundary Class mark ( xm ) Frequency ( f )
20 − 29 19.5 – 29.5 24.5 7
30 − 39 29.5 – 39.5 34.5 21
40 − 49 39.5 − 49.5 44.5 4
50 − 59 49.5 – 59.5 54.5 2
60 − 69 59.5 – 69.5 64.5 1
B. 39
C. 49.5
D. 44.5
A bar graph represents the data by using vertical or horizontal bard whose heights represent the
frequency of the respective categories.
EXAMPLE 2−5
The given data represents the average amount of money spent by first year college students. Construct
a bar graph for the data.
Food $765
Clothing $443
Step 1: Draw and label the x and y axis. For the vertical bar graph, place the frequency scale on the y
axis.
Bar Graph
1,000
800
Amount
600
400
200
0
Food Clothing Text Books Technical
Gadgets
Type of spending
A Pareto chart is used to present categorical data and the frequency are displayed by heights of
vertical bars, which are arranged in order from highest to lowest.
EXAMPLE 2−6
SOLUTION
Step 1: Arrange the data from largest to smallest according to the frequency.
Food $765
Clothing $443
Pareto Chart
900
800
700
600
Amount
500
400
300
200
100
0
Technical Gadgets Food Text Books Clothing
Type of Spending
A time series graph represents data that occur over a specific period of time.
EXAMPLE 2−7
The data below shows the number of athletes’ participating in a five-day athletics tournament organized
by the Oceania Sports Council. Construct a time series graph.
Tuesday 14
Wednesday 22
Thursday 36
Friday 43
A pie chart is a circle that is divided into sections or wedges according to the percentage of
frequencies in each category of the distribution.
EXAMPLE 2−8
This frequency distribution shows the preference of drink by people in a cocktail party. Construct a pie
chart for the data.
Response Frequency
Red Wine 77
Whiskey 48
Tribe 65
Total 190
SOLUTION
Step 1: Since there are 360 in a circle, the frequency for each class must be converted to degrees.
f
Degrees 360
n
Step 2: Each frequency must also be converted to a percentage.
f
Percentage 100%
n
Pie Chart
Tribe
34%
Red Wine
41%
Whiskey
25%
We will now look at the graphical presentation of quantitative (numerical) data. Some of the
graphs/charts by which we can present quantitative data are:
Histograms
Frequency polygons
Ogive or cumulative frequency graphs
2.3.5 Histograms
A histogram is the most commonly used graph to represent a quantitative data. The horizontal axis ( x −
axis) represents the data (or class boundaries) and the vertical axis ( y −axis) represents the frequency.
A histogram is a graph that displays the data by using contiguous vertical bars of various heights to
represent the frequencies of the class.
The data below represents the number of items rejected daily by a manufacturer because of defects was
recorded for the last 25 days. Construct a histogram.
SOLUTION
Step 1: Draw and label the x and y axes.
Step 2: Represent the frequency on the y axis and the class boundaries on the x axis.
Step 3: Using the frequencies as the heights, draw vertical bars for each class.
Histogram
10
8
frequency
0
5.5-10.5 10.5-15.5 15.5-20.5 20.5-25.5 25.5-30.5
items rejected
A histogram is a graph that displays the data by using contiguous vertical bars of various heights
to represent the frequencies of the class. The frequency polygon is a graph that displays the data
by using line that connect points plotted for the frequencies at the midpoints of the class. The
frequencies are represented by the heights of the points.
SOLUTION
Step 2: Draw the x and y axes. Label the x axis with the midpoints of each class and the y axis for the
frequencies.
Step 3: Using the midpoints for the x values and the frequencies as the y values, plot the points.
Frequency Polygon
10
8
Frequency
6
4
2
0
3 8 13 18 23 28 33
items rejected
EXAMPLE 2−11
SOLUTION
Step 2: Draw and label the x and y axes. The cumulative frequencies will go on the y-axis and the upper
class boundaries will go on the x-axis.
Step 3: Using the upper class boundaries for the x values and the cumulative frequencies as the y values,
plot the points.
Ogive
30
cumulative frequency
25
20
15
10
5
0
5.5 10.5 15.5 20.5 25.5 30.5
items rejected
EXAMPLE 2−12
Construct a histogram, frequency polygon, and an ogive using relative frequencies for the distribution of
the weights of 50 randomly selected ST130 students.
SOLUTION
Step 1: Calculate the class boundaries, class midpoints, relative frequency and cumulative relative
frequency.
The histogram will be drawn using class boundaries in x-axis and relative frequency in y-axis.
The frequency polygon will be drawn using midpoints in x-axis and relative frequency in y-axis.
The ogive will be drawn using upper class boundaries in x-axis and cumulative relative frequency
in y-axis.
Frequency Polygon
0.4
relative frequency
0.3
0.2
0.1
0
24.5 34.5 44.5 54.5 64.5 74.5 84.5
weights
Ogive
1.2
cumulative relative frequency
1
0.8
0.6
0.4
0.2
0
29.5 39.5 49.5 59.5 69.5 79.5
weights
Symmetric Frequency Curve: It is approximately identical on both sides of a line running through
the center. This type of distribution is known as bell-shaped
distribution.
Uniform Frequency Curve: If a curve has the same frequency for each class, then it is said to
be uniform or rectangular curve.
A stem and leaf is a data plot that uses part of the data value as the stem and part of data value as
the leaf to form groups or classes.
EXAMPLE 2−13
At an outpatient-testing center, the number of cardiograms performed each day for 20 days is shown
below. Construct a stem-and-leaf plot for the data.
25 31 20 32 13 14 43 02 57 23
36 32 33 32 44 32 52 44 51 45
SOLUTION
To construct a stem-and-leaf plot for the above data, we follow these steps:
02, 13, 14, 20, 23, 25, 31, 32, 32, 32, 32, 33, 36, 43, 44, 44, 45, 51, 52, 57
02
13, 14
20, 23, 25
31, 32, 32, 32, 32, 33, 36
43, 44, 44, 45
51, 52, 57
Step 3: Using the unit (trailing) digit values as leaves, the corresponding stem-and- leaf plot is shown
below:
Stem Leaf
0 2
1 34
2 035
3 1222236
4 3445
5 127
By looking at this stem-and-leaf display, we can observe how the data values are distributed. For
example, the stem 3 has the highest frequency, followed by stems 4, 2, 5, 1, and 0.
2.6 Summary
This chapter focused on statistical technique of organizing and presenting of data. The data was
organized using a frequency distribution table and presented using various graphs such as bar graph
Pareto charts, time series graphs, pie charts, histogram, frequency polygon and ogive. We also learnt to
recognize the shape of the frequency distributions and construct stem and leaf plots.
EXERCISES
1. Twenty-five army inductees were given a blood test to determine their blood type. The following data
was obtained:
A B B AB O O O B AB B
B B O A O A O O O AB
AB A O B A
2. The amount of protein (in grams) for a variety of fast food sandwiches is reported here.
23 30 20 27 44 26 35 20 29 29
25 15 18 27 19 22 12 26 34 15
75 66 77 66 64 73 91 65 59 86 61 86 61 58 70
77 80 58 95 78 62 79 83 54 52 45 82 48 67 55
DATA DESCRIPTION
Objectives
After completing this chapter, you should be able to:
1. Describe data, using measures of central tendencies, such as mean, median, mode and
midrange.
2. Describe data, using measures of variations, such as range, variance and standard deviation.
3. Identify the position of a data value in a data set, using various measures of position, such as
standard scores, percentiles, deciles and quartiles.
4. Check for outliers in a data set.
5. Use the techniques of exploratory data analysis, including boxplots to discover the nature of the
data.
3.1 Introduction
In Chapter 2, we have seen how one can analyse the raw data by organizing it into a frequency
distribution and the presenting the data by using various graphs. Organizing the presenting alone is not
enough to describe data meaningfully so we will now examine some statistical methods that can be used
to describe the data. The methods include measures of central tendency, measures of variation and
measures of position.
The measure of average or the measure of central tendencies is numerical measures that locate the
center of the dataset. Measures of central tendency include mean, median, mode, midrange and
weighted mean.
Knowing the average such as mean, median and mode is not enough to describe the dataset entirely,
therefore the measure of variation or dispersion is studied. The measure of variation or dispersion is
numerical measures that determine the spread of data values from the center. Measures of variation
include range, variance, and standard deviation.
In addition to measure of central tendency and measure of variation, there are measures of position or
location. They are used to locate the relative position of the data value in the dataset. Measures of position
include percentiles, deciles and quartiles. These measures are used extensively in psychology and
education and sometimes they are referred to as norms.
The types of measures of central tendency that will be discussed in this section are mean, median, mode,
midrange and weighted mean.
A parameter is a characteristic or measure obtained by using all the data values from an entire
population.
A statistic is a characteristic of measure obtained by using all the data values from a specific sample
chosen from a large population.
General Rounding Rule: When computations are done in statistics, the basic rounding rule is that,
rounding should not be done until the final answer is calculated. If rounding is done in every step along
the way, it tends to increase the difference between that answer and the exact one.
The symbol X represents the sample mean and represents the population mean.
We use the following formulas summarized in the table below to compute the mean:
Sample
X
X X
fX X
fX m
n n n
Population
X
fX
fX m
N N N
Where,
n is the sample size
N is the population size
f is the frequency of a class
X m is the midpoint of a class interval
The data given below represents the marks scored by a sample of 11 students selected from a particular
English class. Find the mean mark.
67, 89, 49, 55, 87, 79, 72, 69, 81, 52, 91
SOLUTION
Since the dataset represents the sample and is a raw data, the mean is given by:
X
X
67 89 91
791
719
n 11 11
Hence, the mean mark is 71.9
Rounding Rule for the Mean. The mean should be rounded to one more decimal place than it occurs
in the raw data.
EXAMPLE 3−2
Using the frequency distribution as in Example 2-2 of Chapter 2, find the mean.
SOLUTION
Rating( X ) Frequency ( f ) fX
1 2
2 1
3 2
4 2
5 2
6 5
7 3
8 2
9 2
10 3
Total n = 24
Step 3: Find the sum of the values in the 3rd column. The completed table is shown below.
Rating( X ) Frequency ( f ) fX
2
1 2
2
2 1
6
3 2
8
4 2
10
5 2
30
6 5
21
7 3
16
8 2
18
9 2
30
10 3
Total n = 24 fX = 143
Step 4: Divide the sum of 3rd column by n to get the mean.
X
fX
143
5.96
n 24
EXAMPLE 3−3
The following is the distribution of the number of fish caught by all 50 fishermen in a coastal area. Find
the mean number of fish caught by a fisherman.
No. of fishermen No. of fishermen
11 − 15 12
16 − 20 14
21 − 25 13
26 − 30 11
16 − 20 14
21 − 25 13
26 − 30 11
n = 50
Step 2: Find the midpoint of each class and enter them in the 3rd column.
Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4 th column.
Step 4: Find the sum of the values in the 4th column. The completed table is shown below.
16 − 20 14 18 252
21 − 25 13 23 299
26 − 30 11 28 308
n = 50 fX m = 1015
fX m
1015
20.3
N 50
The median is the midpoint of the data set when the data is arranged in order.
The numbers of comics purchased on a particular day by nine school students are given below.
3, 7, 10, 5, 9, 4, 11, 7, 2
Find the median.
SOLUTION
EXAMPLE 3−5
The numbers of tropical cyclones in the Pacific over the 8–year period is as follows.
SOLUTION
687 702
The median number of tropical cyclones is 694.5 .
2
EXAMPLE 3−6
SOLUTION
Step 1: Find the class boundaries, cumulative frequency and cumulative percentage for each class.
cumulative frequency
cumulative percentage 100
Total frequency
The table is shown below:
15.5 – 20.5 14 26 26
100 52
50
20.5 – 25.5 13 39 78
50
Step 2: Using the upper class boundaries for the x values and the cumulative percentage as the y values,
plot the points. This type of ogive is called a Percentile Graph.
Percentile Graph
100
cumulative percentage
90
80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught
To estimate the median, find the x−value corresponding to the y-value of 50 from the percentile graph.
So the median is estimated to be 20.
Find the mode of the transfer fees of 9 professional soccer players for a specific year. The transfer fee in
millions of dollars is: 1.2, 12.0, 4.5, 6.1, 8.3, 4.5, 7.2, 11.0, 4.5
SOLUTION
Since $4.5 million occurred 3 times (most often), the mode is $4.5 million.
EXAMPLE 3−8
SOLUTION
A. Since each value occurs only once, there is no mode. (Do not say that the mode is zero).
B. Since both 45 and 55 occur most often (3 times each), the modes are 45 and 55. This set of data
is said to be bimodal.
EXAMPLE 3−9
SOLUTION
The modal class is 16 – 20, as it has the highest frequency. Note: In many cases, the measures of central
tendency may have significantly different values. One has to be very cautious in using these measures.
EXAMPLE 3−10
A small company consists of the owner, the manager, salesperson and two technicians, all of whose
annual salaries are listed below. Find the mean, median and mode.
Here the mean is $20,000, the median is $12,000 and the mode is $9,000. The mean is much higher
than median and mode because the extremely high salary of the owner. In such situations, the median
should be used as the measure of central tendency.
EXAMPLE 3−11
SOLUTION
9000 +50000
MR 29,500
2
Hence, the midrange is 29,500. The midrange is affected by extreme value of $50,000 in the dataset.
Note: In statistics, several measures can be used for an average. The most common measures are
mean, median, mode and midrange. Each has its own specific purpose and use. The median is a better
measure when there are extreme values in the dataset. 3−10
The weighted mean of the data set x1 x2 … xn with respective weightings w1 w2 … wn , is given by
Weighted mean
w1 x1 w2 x2 wn xn
w x .
i i
w1 w2 wn w i
The mid-semester test had a weight of 20%, assignments had a weight of 10% each and the final exam
has a weight of 60%.
SOLUTION
As in regulation, the weights for the results are in the following ratio:
For awarding the final result, we have to take this weighting into account:
EXAMPLE 3−13
I wish to test two brands of outdoor paint to see how long each will last before fading. The results (in
months) are shown. Find the mean and median of each group. (Assume Population)
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
The mean and median for both brands of paint is 35 months. Since the mean and median for both brands
of paint is same, we cannot conclude which paint is better using these measures of central tendencies.
The types of measures of variation that will be discussed in this section are range, variance, and standard
deviation.
3.3.1 Range
The range is the simplest measure of variation and is defined as:
The range (R) is the highest value minus the lowest value in the data set. That is
EXAMPLE 3−14
Find the range for the two brands of paints given in Example 3−13.
SOLUTION
Since the range of Brand B is less it can be concluded that Brand B is less variable (more reliable or a
better choice) than Brand A.
Since range is not good measure of variability if there are extreme values in the dataset, statisticians use
other measures called the variance and standard deviation.
The corresponding formulas used to calculate these variances of raw data are
2
( X ) 2
and s 2
( X X ) 2
,
N n 1
Where,
X and X X
N n
EXAMPLE 3−15
Find the variance and standard deviation for Brand A paint data given in Example 3−13.
SOLUTION
X
210
35
N 6
Step 2: Subtract the mean from each data value and square each result. The completed table is shown
below.
Brand A (X) ( X )2
10 (10 – 35)2 = 625
60 (60 – 35)2 = 625
50 225
30 25
40 25
20 225
( X ) 2
625 625 225 25 25 225 1750
2
( X ) 2
1750
291.7
N 6
Remarks:
1. The variance and standard deviation of Brand B paint is 41.7 and 6.5 respectively.
2. Since the standard deviation of Brand B is less, one can conclude that brand B is less variable (more
reliable or a better choice) than Brand A.
X fX f X
2 2 2 m
m
s
2 n s
2 n s2 n
n 1 n 1 n 1
Population X fX f X
2 2 2
X fX f X
m
2 2 2
m
N 2 N 2 N
2
N N N
Note: Always use the shortcut formulas to compute variance and standard deviation.
EXAMPLE 3−16
Find the variance and standard deviation for Brand A paint data given in Example 3−13 using the shortcut
formula.
SOLUTION
Step 2: Square each data value and enter them in the 2nd column
Brand A ( X ) X2
10 100
60 3600
50 2500
30 900
40 1600
20 400
X 210 X
2
9100
EXAMPLE 3−17
Find the variance and standard deviation of the number of fish caught using the data in Example 3−3.
SOLUTION
16 – 20 14
21 – 25 13
26 – 30 11
n = 50
Step 2: Find the midpoint of each class and enter them in the 3rd column.
Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4 th column. Find
the sum of the values in the 4th column.
Step 4: For each class, multiply the frequency with the square of the midpoints and enter them in the
5th column. Find the sum of the values in the 5th column. The completed table is shown below.
21 – 25 13 23 299 6877
26 – 30 11 28 308 8624
n = 50 fX m 1015 f X 2
m 22065
The coefficient of variation, denoted by CV, is the standard deviation divided by the mean. The result
is expressed as a percentage.
For population C V 100%
s
For sample C V 100%
x
EXAMPLE 3−18
The mean of the number of sales of airplane engines over a 6-month period is 92, and the standard
deviation is 5. The mean of the commissions earned is $5255, and the standard deviation is $770.
Compare the variations of the two.
SOLUTION
The types of measures position that will be discussed in this section are standard scores, percentiles,
deciles and quartiles.
A z score or standard score for a value is obtained by subtracting the mean from the value and dividing
the result by the standard deviation, i.e.
X
For population z
X X
For sample z
s
EXAMPLE 3−19
A student scored 90 on Maths test that had a mean of 52 and a standard deviation of 10; he also scored
45 on an English test with a mean of 35 and a standard deviation of 5. Compare her relative positions on
the two tests.
SOLUTION
XX 90 52
For Maths: z = z = 3.8
s 10
XX 45 35
For English: z = z = 2.0
s 5
The score for Maths test is higher than the score for English test.
3.4.2 Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the position of
an individual in a group.
Percentiles are data values that divide the dataset into 100 equal parts where the dataset should be in
an ascending order. Each set of observations has 99 percentiles and are denoted by P1 P2 … P99 .
… … ...
1% 1% 1% … … ... 1% 1% 1%
Remarks:
1. P20 is called the 20th percentile, which indicates that 20% of the scores fall below P20 .
2. P50 is called the 50th percentile, which indicates that 50% of the scores fall below P50 .
P50 median.
Note:
1. To calculate quartiles and deciles of a raw data, convert them to percentiles and use the same
steps.
2. To estimate percentiles, deciles and quartiles of a raw data use a Percentile Graph.
Percentile Rank
We can calculate the percentile rank for a particular value x of a data set by using the formula:
… … ...
D1 D2 D3 D7 D8 D9
Remarks:
1. D4 is called the 4th decile, which indicates that 40% of the scores fall below D4 .
2. D5 is called the 5th decile, which indicates that 50% of the scores fall below
3. P50 D5 median.
4. D1 P10 ; D2 P20 ; D3 P30 ; D9 P90
3.4.4 Quartiles
Quartiles are data values that divide the dataset into 4 equal parts where the dataset should be in an
ascending order. Each set of observations has 3 quartiles and are denoted by Q1 Q2 and Q3 .
Remarks:
1. Q1 is called the 1st quartile (or lower quartile), which indicates that 25% of the scores fall below
Q1
2. Q3 is called the 3rd quartile (or upper quartile), which indicates that 75% of the scores fall below
Q3
70, 77, 65, 56, 99, 62, 79, 73, 85, 87, 92, 82
SOLUTION
56, 62, 65, 70, 73, 77, 79, 82, 85, 87, 92, 99
Therefore,
85 87
Q3 86.
2
10 0.5
4. Percentile rank of 92 100% 87.5.
12
Hence, approximately 87.5% of the scores are below 92 in the given data.
EXAMPLE 3−21
1. P20 .
2. Percentile rank for the score 26.
SOLUTION
Percentile Graph
100
90
cumulative percentage
80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught
The interquartile range is the difference between the upper quartile and the lower quartile. That
is,
Interquartile range (IQR) Q3 Q1
The quartile deviation is the half of the difference between the upper quartile and the lower
quartile. That is,
Q3 Q1
Quartile deviation (QD)
2
EXAMPLE 3−22
Find the interquartile range and the quartile deviation for the given data in Example 3−20.
SOLUTION
Q1 67.5 and Q3 86
Therefore,
Interquartile range Q3 Q1 86 67.5 18.5
and
Q3 Q1 86 67.5
Quartile deviation 9.25
2 2
An outlier is an extremely high or an extremely low data value when compared with the rest of the
data values.
EXAMPLE 3−23
SOLUTION
The data value 70 is a suspect that it is an outlier. Using the procedure given above we have:
In EDA,
Data can be organised using a stem and leaf plot.
The measure of central tendency used is the median.
The measure of variation used is the interquartile range.
Data are represented graphically using a box-plot.
A box-plot is a graph that is used to determine the nature and shape of the distribution in EDA. It is
obtained by drawing a horizontal line from the minimum data value to Q1 , drawing a horizontal line from
Q3 to the maximum data value, and drawing a box whose vertical sides pass through Q1 and Q3 with
a vertical line inside the box passing through the median.
EXAMPLE 3−24
SOLUTION
Step 1: The Five-Number Summary (Note: The data should be arranged in ascending order first)
1. The lowest value is 3;
2. Q1 8 ;
3. The median is 11.5;
4. Q3 16 ;
5. The highest value is 20;
8 1 1
3 1 6
.
5
0 4 8 12 16 20 22
3.7 Summary
This chapter discusses the statistical techniques of describing data. The data was described using the
techniques such as measure of central tendencies, measure of variations and measure of positions. The
measure of central tendencies include mean, median, mode and midrange to locate the center of the
data set, the measure of variations include range, variance and standard deviation to gauge the spread
of data values, the measure of positions include standard score, percentile, decile and quartile to locate
the position of the data values. Further, the chapter explains how to detect outliers in a data set and how
to construct box-plot.
EXERCISES
2. A survey of all the 110 firms in a small state was carried out to find the number of people employed
at each. The results are shown in the following table.
Number of Employees 1 – 10 11 – 20 21 – 30 31 – 40 41 – 50
Frequency 32 34 14 12 18
3. Suppose an instructor gives two exams and a final exam, assigning the final exam a weight twice
that of each of the other exams. Find the weighted mean for a student who scores 73 and 67 on the
first two exams and 85 on the final exam.
4. An analysis of monthly wages paid to the workers of firm A and B belonging to the same industry
gives the following results:
Firm A Firm B
Number of Workers 100 200
Average monthly wage $196 $185
Variance of distribution of wages $81 $144
PROBABILITY (PART I)
Objectives
After completing this chapter, you should be able to:
1. Find the sample space of probabilistic experiments.
2. Calculate the probability using classical and empirical approach.
3. Calculate the probability using the addition rule.
4.1 Introduction
In this section, we introduce students to probability, that is where probability can be used and the definition
of probability. It further outlines other concepts that you will learn in this chapter.
What is Probability?
No doubt, you are familiar with terms such as probability, chance and likelihood. They are often used
interchangeably. Statements that involve probability are:
The weather forecaster announces that there is an 80 percent chance of rain in a soccer match.
The probability that a certain brand of computer will survive 100,000 hours of operation without
repair is 0.75.
What are chances of Fiji winning the IRB series this year?
Probability, which is an important part of statistics, is a number that describes the chance that something
will happen. A more formal definition is:
Probability is the numerical measure of the likelihood that a specific event will occur.
Many people are familiar with probability from observing or playing various games of chance using cards,
coins and dice, or in lotteries. In addition to being used in games of chance, probability theory is often
used for explaining many real-world phenomena and helps us in decision-making in the fields of
insurance, investments, and weather forecasting and in various other areas. Finally, probability theory is
the basis of inferential statistics, which we will discuss in later Chapters in this course.
In this chapter, the basic concepts of probability are explained. These concepts include probability
experiments, sample spaces, outcomes, events and many others. Further, this Chapter also explains the
three basic interpretations of probability, mutually exclusive events and the addition rules of probability.
EXAMPLE 4−1
Tossing two coins or tossing a coin two S = {HH, HT, TH, TT}
times
Tossing a coin and then rolling a die S = { H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6}
SOLUTION
Since each die can land in six different ways, and two dice are rolled, the sample space can be presented
by a rectangular array as follows:
Die 2
Die 1
1 2 3 4 5 6
EXAMPLE 4−3
Find the sample space for drawing one card from an ordinary deck of cards.
SOLUTION
A tree diagram is a device consisting of line segments emanating from a starting point and also from
the outcome point. It is used to determine the sample space in a systematic way.
Use a tree diagram to find the sample space for a family of three children.
SOLUTION
Since there are two possibilities (boy or a girl) for the first child, draw two branches from a starting point
and label one B and the other G. Then if the first child is a boy, there are two possibilities for the second
child (boy or a girl), so draw two branches from B and label one B and the other G. Do the same if the
first child is a girl. Follow the same procedure for the third child. The completed tree diagram is shown
below. To find the outcomes for the sample space, trace through all possible branches.
B BBB
BBG
G
BGB
B
B BGG
G GBB
G GBG
B GGB
B GGG
G
G
For example, in the experiment of tossing two coins, where the sample space is S = {HH, HT, TH, TT}.
We can denote an event E to be getting 2 heads that is E = {HH} or event F to be getting no heads that
is F = {TT}.
A simple event is an event with only one sample point.
A compound event is an event with more than one sample point.
An event, which does not contain any sample point, is called an impossible event (or null event
or empty event). It is denoted by 0 .
An event, which contains all the sample points of the sample space, is called sure (or certain)
event.
EXAMPLE 4−5
In an experiment of throwing a die, classify the events below as simple, compound, sure or impossible
event.
A. Getting a six
B. Getting even faces
C. Getting even or odd faces
D. Getting a seven
SOLUTION
A. Simple
B. Compound
C. Compound and Sure
D. Impossible
A Venn diagram uses circles to represent sets, in which the relations between the sets are indicated
by the arrangement of the circles.
For example, out of forty students, 14 are taking English and 29 are taking chemistry at USP. If five
students are in both classes, the Venn diagram to represent this is:
S
A
A
EXAMPLE 4−6
Consider the experiment of rolling a die, the sample space is S = {1, 2, 3, 4, 5, 6} If A = {1, 3, 5} then Ā =
{2, 4, 6}.
A B S
A B
EXAMPLE 4−7
A B S
A B
EXAMPLE 4−8
EXAMPLE 4−9
EXAMPLE 4−10
Find the probability of getting a red ace when a card is drawn from an ordinary deck of cards.
SOLUTION
Let R = red ace. Since there are 52 cards and 2 red aces (the ace of hearts and ace of diamonds) in an
ordinary deck of cards, P(R) = 2/52 = 1/26.
EXAMPLE 4−11
If a family has three children, find the probability of the following events:
SOLUTION
Refer to the sample space in Example 4–3. There are 8 outcomes in the sample space.
EXAMPLE 4−12
Two dice are rolled. Find the probability of the following events:
A. E: The sum of faces is equal to 7.
B. F: The sum of faces is greater than 7.
C. G: The sum of faces is 7 or 11.
Refer to the sample space in Example 4-2. The total number of outcomes is 36.
A. There are 6 outcomes in the sample space whose sum is 7. Therefore, P (E) = 6 / 36 = 1 / 6.
There are 15 outcomes in the sample space whose sum is greater than 7. Therefore,
B. P (F) = 15 / 36 = 5 / 12.
C. There are 8 outcomes in the sample space whose sum is 7 or 11. Therefore, P (G) = 8 / 36 = 2
/ 9.
EXAMPLE 4−13
A marble is drawn from a bag containing 3 white, 2 red and 5 blue marbles. What is the probability that
the marble drawn is:
A. green,
B. white,
C. not white, and
D. White or red.
SOLUTION
A. P(green) = 0/10 = 0.
B. P(white) = 3/10.
C. P(not white) = 1 − P(white) = 1 − 3/10 = 7/10.
D. P(white or red) = 5/10.
In a sample of 50 people, 21 had type O blood, 22 had A blood, 5 had type B blood, and 2 had type AB
blood. Construct a frequency distribution and find the probability that:
A. A person has type A blood.
B. A person has type A or type B blood.
C. A person neither type A nor type O blood.
D. A person does not have type O blood.
SOLUTION
EXAMPLE 4−15
A computer supplies store is concerned that it may be over-stocking printers. The store has tabulated the
number of printers sold weekly for each of the past 80 weeks. The results are summarized in the following
table:
SOLUTION
A.
No. of printers sold 0 1 2 3 4
Probability 36/80 28/80 12/80 2/80 2/80
B. Empirical
C. 4/80 = 1/20
EXAMPLE 4−16
Determine which events are mutually exclusive and which are not when a single die is rolled.
SOLUTION
A. The first event has outcomes 1, 3, 5 and the second event has outcomes 2, 4, 6, therefore the
events are mutually exclusive since there is no outcome in common.
B. The first event has outcome 4 and the second event has outcomes 2, 4, 6, therefore the events
are not mutually exclusive since 4 is common in both events.
Determine which events are mutually exclusive and which are not when a single card is drawn from a
deck.
A. Getting a 3 and getting a 6.
B. Getting a 3 and getting a diamond.
C. Getting a red card and getting an ace.
SOLUTION
Addition Rule
If A and B be any two events, then the probability of the occurrence of either event A or event B is
1. P (A or B) = P (A) + P(B), when A and B are mutually exclusive.
2. P (A or B) = P( A ) + P ( B) — P (A ∩ B), when A and B are not mutually exclusive.
Note:
The above rules can be extended to more than two events.
EXAMPLE 4−18
In a class, there are 20 Fijian, 13 Samoan, and 6 Tongan students. If a student is selected at random,
find the probability that he/she is either a Fijian or Tongan student.
SOLUTION
EXAMPLE 4−19
A single card is drawn from a deck. Find the probability that it is a spade or an ace.
EXAMPLE 4−20
A Mac Donald’s consumer is selected at random. The probability he has tried a Big Mac is 0.5, tried soft
Cone is 0.6 and tried both Big Mac and soft Cone is 0.2. Find the following probability:
SOLUTION
B S
0.1
A. 0.9
B. 0.4
C. 0.1
D. 0.5
4.5 Summary
In this chapter, we were looked at the basic concepts of probability. It explained the terms and concepts
such as probability; experiments; probabilistic and non-probabilistic experiments; sample space;
outcome; tree diagram and Venn diagram; event; simple, compound, null and sure events; complement,
intersection and union of events. Later, it discussed the three interpretations of probability that are
classical, empirical and subjective probability and the additional rules of probability.
1. A coin is tossed; if it falls head up, it is tossed again. If it falls tail up, a die is rolled. Draw a tree
diagram and determine all possible outcomes.
4. In USP, the probability that a student takes calculus or is on scholarship is 0.85. The probability that
a student is on scholarship is 0.61 and the probability that a student is taking calculus is 0.31.
A. Are events C: student takes calculus, and S: student is on scholarship mutually exclusive events?
Explain.
B. If a student is randomly chosen, find the probability that the student is taking calculus and is on
scholarship.
C. If a student is randomly chosen, find the probability that the student is neither taking calculus nor
is on scholarship.
5. For a card drawn from an ordinary deck, find the probability of getting a:
A. Queen
B. 3 and a diamond
C. 3 or a diamond
D. 3 or a 6
6. In a hospital unit, there are 8 nurses and 5 doctors; 7 nurses and 3 doctors are females. If a staff is
selected at random, find the probability that the staff:
A. Is a female
B. Is a nurse and a female?
C. Is a nurse or a female?
7. Tom and Jerry rolls two dice 50 times and record the sum of the rolls of two dice in the table below.
Objectives
After completing this chapter, you should be able to:
1. Find the probability of compound events using multiplication rules.
2. Find the conditional probability of an event.
3. Utilize the fundamental counting rule, permutation and combination.
4. Find the probability of an event using the counting rules.
5.1 Introduction
In the previous chapter, we have looked at some basic concepts of probability. Further, we also explained
the three basic interpretations of probability, the concepts of mutually exclusive events and the addition
rules.
The purpose of this chapter is to look at some more concepts of probability such as independent events,
dependent events, conditional probability and counting rules
Two events A and B are independent events if the fact that A occurs does not affect the
probability of B occurring.
EXAMPLE 5−1
To test for independence of two events, we can use the following rule:
P (A ∩ B) = P (A) × P (B).
A coin is flipped and a die is rolled. Find the probability of getting a head on the coin and a 6 on the die.
SOLUTION
Let A = getting a head on the coin and B = getting a 6 on the die. The events A and B are independent,
therefore
P (A ∩ B) = P (A) × P (B)
= 1 /2 × 1 / 6 = 1 /12
EXAMPLE 5−3
In a group of 60 students, 20 study History, 24 study French and 8 study both History and French. Are
the events a student studies History and a student studies French independent?
SOLUTION
20 1 24 2 8 2
P(History) P(French) P(History and French) Now,
60 3 60 5 60 15
1 2 2
P(History) P(French)
3 5 15
P (History and French) = P (History) P (French).
EXAMPLE 5−4
At USP 74.3% of the incoming first year students have computers. If 2 students are selected at random,
find the probabilities.
SOLUTION
Let C = student has a computer and N = the student does not have computer.
The tree diagram for this problem is as follows:
0.257 N
0.743
N C
0.257 0.743
0.257 N
Note: The above rules of independent can be extended to more than two events. That is if A, B and C
are independent events then P (A ∩ B ∩ C) = P (A) × P (B) × P(C).
P( A B)
P B | A , provided P(A) 0.
P( A)
EXAMPLE 5−5
In a certain city, the probability that an automobile will be stolen and found within one week is 0.0009.
The probability that an automobile will be stolen is 0.0015. Find the probability that a stolen automobile
will be found within one week.
SOLUTION
P( A B) 0.0009
P( B A) 0.6.
P( A) 0.0015
EXAMPLE 5−6
A random sample of 200 adults is classified below according to gender and level of education attained.
Gender
Education Total
Male Female
Elementary 38 45 83
Secondary 28 50 78
College 22 17 39
Total 88 112 200
If a person is picked at random from this group, find the probability that:
A. The person is male, given that the person has secondary education.
B. The person does not have a college degree, given that the person is a female.
SOLUTION
P(male secondary)
A. P(male secondary)
P(secondary)
28 / 200
78 / 200
28
0.36
78
P(no college degree female)
B. P(no college degree female)
P(female)
(45 50) / 200
112 / 200
95
0.85
112
EXAMPLE 5−7
EXAMPLE 5−8
A company estimates that 30% of the country has seen its commercial and that if a person sees its
commercial, there is 20% probability that the person will buy its products. What is the probability that a
person chosen at random in the country has seen the commercial and bought the product?
SOLUTION
Let A = the person sees the commercial and B = the person buys the commercial. Therefore,
EXAMPLE 5−9
A flashlight has 6 batteries, 2 of which are defective. If 2 are selected at random without replacement,
find the probability that:
A. Both are defective.
B. None are defective.
C. At least one is defective.
SOLUTION
Let D = the battery is defective and G = the battery is good. The tree diagram for this problem is
D
2/6 4/5
G
D
4/6 2/5
G
3/5 G
Note: The second branch has conditional probabilities, that is 1/5 is the probability that the second battery
is defective given that the first battery was defective. Similarly, 3/5 is the probability that the second
battery is good given that the first battery was good.
Using the tree diagram,
EXAMPLE 5−10
Three cards are drawn from an ordinary deck without replacement. Find the probability of these.
A. Getting 3 jacks.
B. Getting an ace, a king, and a queen in order.
C. At least one jack.
SOLUTION
4 3 2 1
A. P 3 jacks .
52 51 50 5525
4 4 4 8
B. P an ace, a king and then a queen .
52 51 50 16,575
EXAMPLE 5−11
If a coin is tossed:
two times, then the total number of outcomes = 2 × 2 = 22 = 4.
three times, then the total number of outcomes = 2 × 2 × 2= 23 = 8.
r times, then the total number of outcomes = 2r.
EXAMPLE 5−12
If a die is rolled:
two times, then the total number of outcomes = 6 × 6 = 62 = 36.
r times, then the total number of outcomes = 6r.
EXAMPLE 5−13
How many different license plate numbers can be made using two letters followed by three digits, if letters
and digits may be repeated?
SOLUTION
Since there are 26 alphabets (A, B, C, X, Y, Z) and 10 digits (0, 1, 2, …, 9) that can be used to form a
license plate number, then the total license plate numbers possible is
26 × 26 × 10 × 10 × 10 = 676000.
The chairs in a room are to be labelled with a vowel letter and a positive integer not exceeding 99. What
is the largest number of chairs than can be labelled differently?
SOLUTION
Since there are 5 vowels (A, E, I, O, U) and 99 integers not exceeding 99 (1, 2… 99) that can be used to
label the chair, then the largest number of chairs than can be labelled differently are
5 × 99 = 495.
Factorial Notation
Before discussing the permutation, we introduce a useful shorthand notation-the factorial symbol. The
symbol n! read as “n factorial,” is defined as:
n! = n(n — 1)(n— 2) × …× 3 × 2 × 1
Where,
0! = 1
1! = 1.
For example, 5! can be written as 5 × 4 × 3 × 2 × 1 = 120 (factorial can be computed directly using a
calculator)
EXAMPLE 5−15
Note: P(n, n) n!
EXAMPLE 5−16
How many ways are there to select a first-prize winner, a second-prize winner and a third-prize winner
from 50 different students who have entered a mathematics contest?
How many different ways can a chairperson and an assistant chairperson be selected for a research
project if there are seven scientists available?
EXAMPLE 5−18
How many 3 digit numbers that can be formed from the digits: 1, 2, 3, 4, 5, 6, 7?
EXAMPLE 5−19
How many distinct ways the letters in the word "STATISTICS" can be arranged?
SOLUTION
Since there are 3 S's, 3 T's and 2 I's, the number of distinct ways the letters can be arranged is
10!
50, 400 .
3!3!2!
EXAMPLE 5−20
How many different vertical arrangements are possible for 10 flags if 2 are white, 3 are red and 5 are
blue?
SOLUTION
Since there are 2 white, 3 red and 5 blue flags, the number of different vertical arrangements possible is
10!
2520.
2!3!5!
n!
C (n, r ) nCr .
(n r )!r !
EXAMPLE 5−21
How many ways are there to select six players from a 15-member volleyball team for a challenge match
against another department?
EXAMPLE 5−22
How many different ways can a lecturer select two textbooks from a possible of 17?
EXAMPLE 5−23
SOLUTION
A. Since there are 12 people and 4 is to be selected on the committee, hence there are C (12, 4) =
495 ways.
B. There are of C (5, 2) choosing 2 men and C (7, 2) of choosing 2 women, hence there are C (7,
2) × C (5, 2) = 210 ways.
C. At least 2 women on the committee means, 2 women or 3 women or 4 women on the committee.
There are C (7, 2) × C (5, 2) = 210 ways to have 2 women, C (7, 3) × C (5, 1) = 175 ways to have
3 women, C (7, 4) × C (5, 0) = 35 ways to have 3 women, hence there are 210+175+35=420
ways.
EXAMPLE 5−24
Find the probability that if 10 different-sized books are arranged in a row, they will be arranged in order
of size.
SOLUTION
EXAMPLE 5−25
Five cards are drawn from a pack of 52 cards. What is the probability that:
A. All are spades,
B. 2 are hearts and 3 are diamonds, and
C. All are black.
SOLUTION
A pack of cards contains 52 cards out of which 13 are spades, 13 are hearts, 13 are diamonds and 13
are clubs. If 5 cards are drawn, then:
13
C5 1287 33
A. P(all are spades) 52
0.0005.
C5 2598960 66640
C2 13C3 78 286
13
SOLUTION
EXAMPLE 5−27
What is the probability that a four-digit telephone extension has one or more repeated digits?
SOLUTION
5.6 Summary
In this chapter, we discussed the more advance concepts in probability such as such as independent and
dependent events and conditional probability. Later, we also discussed the counting rules such as
fundamental counting rule, permutation and combination to solve some probability problems.
1. In a scientific study, there are 8 guinea pigs, 5 of which are pregnant. If 3 are to be selected at random
to be used in the experiment, find the probability that:
A. All three are pregnant.
B. Exactly 2 are pregnant.
C. At least one is pregnant.
2. Approximately 10% of the students in USP owns a car. If 3 students are selected at random, find the
probability that:
A. All of them own a car.
B. Exactly 2 own a car.
C. At least one own a car.
3. The following table gives the two-way classification of 400 students based on gender and whether or
not they work while being full-time students.
A. A student is randomly selected from this group of 400 students. What is the probability that this
student:
i. does work
ii. work or is male
iii. female and does not work
iv. does not work given male
B. Are the events “male” and “do not work” mutually exclusive events? Explain why or why not.
C. Are the events “female” and “do not work” independent? Explain why or why not.
4. Urn 1 contains 5 red marbles and 3 black marbles. Urn 2 contains 3 red marbles and 1 black marble.
If an urn is selected at random and a marble is drawn, find the probability it will be black.
5. Two cards are drawn at random (without replacement) from a regular deck of 52 cards.
A. What is the probability that the first card is a red and the second card is heart?
B. What is the probability that the first card is a heart and the second card is red?
6. There are 2 roads between town A and B. There are 4 roads between town B and C. How many
different routes may one travel from town A to town C through town B?
9. A committee of 5 people is to be formed from 6 doctors and 9 dentists. Find the probability that the
committee will consist of:
A. All dentists
B. 2 dentists and 3 doctors
10. What is the probability that a seven-digit phone number contains the number 7?
DISCRETE PROBABILITY
DISTRIBUTIONS
Objectives
After completing this chapter, you should be able to:
1. Construct a probability distribution for a discrete random variable.
2. Find the mean, variance, standard deviation and expected value for a discrete random variable.
3. Find probabilities using binomial distribution.
4. Find mean, variance, standard deviation for the variable of binomial distribution.
6.1 Introduction
In the last chapter, we discussed the concepts and rules of probability. This chapter extends the concept
of probability to explain probability distributions. We have seen that random experiment has more than
one outcome and it is impossible to predict which of the many possible outcomes will occur, if the
experiment is performed. In this chapter, we will see that if the outcomes and their probabilities for a
random experiment are known, we can find out what will happen, on average, if the random experiment
is performed many times.
This chapter explains random variable and types of random variables. Then the concept of probability
distribution, its mean and variance for a discrete random variable are discussed. In addition, a special
probability distribution called the binomial distribution is explained.
EXAMPLE 6−1
If we count the no. of heads in each outcome of the sample space, we have
S HH , HT , TH , TT .
X 2 X 1 X 1 X 0
Variables that assume values that are countable are called discrete variables. For example, the number
of students in a class, number of road accidents, etc.
Variables that can assume all values in an interval are called continuous variables. Example weight of a
student in a class, price of a car, etc.
EXAMPLE 6−2
In an experiment of rolling a single die, write the probability distribution of the number of dots.
SOLUTION
Let X be the number of dots on the die, and then the values X can assume are 1, 2, 3, 4, 5, 6. The
probability of each outcome is 1/6. Then the probability distribution of X is given by:
X 1 2 3 4 5 6
EXAMPLE 6−3
In an experiment of tossing a coin 3 times, write the probability distribution of the number of heads.
SOLUTION
The sample space is S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}.
Let X be the number of heads, then the values X can assume are 0, 1, 2, and 3. Then the probability
distribution of X is given by
X 0 1 2 3
1 3 3 1
P (X)
8 8 8 8
EXAMPLE 6−4
In an experiment of rolling two dice, find the probability distribution of a random variable X that represents
the sum of outcomes.
SOLUTION
Refer to Example 4−2 of Chapter 4 for the sample space. When we sum the outcomes, the minimum
sum we get is 2 and the maximum we can get is 12. The values X can assume are 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12.
To find the probability corresponding to 2, we have to find out which outcome when added gives 2 and
there is only one outcome that is {(1, 1)} Since there are 36 outcomes altogether, P(2) = 1/36.
To find the probability corresponding to 3, we have to find out which outcome when added gives 3 and
there are two outcomes {(1, 2), (2, 1)} So, P(3) = 2/36 and so on.
X 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P (X)
36 36 36 36 36 36 36 36 36 36 36
Two balls are drawn in succession without replacement from an urn containing 4 red balls and 3 black
balls. Find the probability distribution of a random variable the number of black balls.
SOLUTION
7
Selecting 2 balls from 7 can be done in C2 = 21 ways. Hence, S contains 21 sample points.
Here, X= the number of black balls = 0, 1, 2.
The probability of selecting 0 black balls (i.e. X = 0) is
C0 4C2 3
6 2
P( X 0) 7
C2 21 7
Similarly,
C1 4C1 12 4
3
P( X 1) 7
C2 21 7
C2 4C0
3
3 1
P( X 1) 7
C2 21 7
The probability distribution of X is given by:
X 0 1 2
1. The probability of each event in the sample space must be between or equal to 0 and 1. That is,
0 P X 1.
2. The sum of the probabilities of all the events in the sample space must equal 1; that is,
P X 1.
EXAMPLE 6−6
B.
X 0 1 2 3
P( X ) 0.08 0.11 0.39 0.27
SOLUTION
2 X 2 P X 2 .
Note:
1. X 2 P X means to multiply square of the value of the random variable to its corresponding
probability, and then add the results.
2. The standard deviation is found by taking the square root of the variance.
EXAMPLE 6−7
Find the mean, variance and standard deviation of the probability distribution in Example 6–3.
SOLUTION
X 0 1 2 3
1 3 3 1
P( X )
8 8 8 8
2 X 2 P X 2
02 (1 / 8) 12 (3 / 8) 22 (3 / 8) 32 (1 / 8) 1.52
0.75.
0.75 0.866.
EXAMPLE 6−8
In a gambling game, a man is paid $5 if he gets all heads or all tails when 3 coins are tossed but he has
to pay out $3 if either 1 or 2 heads show up. What is his expected gain?
SOLUTION
The sample space is given by S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}. Let X = the gain in the
game. Then the probability distribution of X is given by:
X $5 $ –3
P( X ) 2 6
8 8
One thousand tickets are sold at $1 each for a color television valued at $350. What is the expected value
of the gain if a person purchases one ticket?
SOLUTION
Let X = the gain in the game. Then the probability distribution of X is given by:
X $349 $ –1
P( X ) 1 999
1000 1000
Hence, the person may lose $0.65, on average, in each try in the game.
Consider the experiment consisting of tossing a coin three times. Determine whether or not it is a binomial
experiment.
SOLUTION
P( X ) n C X p X q n X .
Where,
n: Number of trials
p: Probability of success in a trial
q: Probability of failure in a trial
X: Number of success in n trials
Note:
1. p q 1.
2. X 0,1, 2, , n.
3. Binomial is a discrete distribution.
EXAMPLE 6−11
A coin is tossed three times. Find the probability of getting exactly two heads.
SOLUTION
If a student randomly guesses at five multiple-choice questions, find the probability that the student gets
exactly three correct. Each question has five possible choices.
SOLUTION
EXAMPLE 6−13
A survey from Teenage Research Unlimited found that 30% of teenage consumers receive their spending
money from part-time jobs. If five are selected at random, find the probability that at least three of them
will have part-time job.
SOLUTION
Let X be the number of consumers having part-time job. In this case: n = 5, X = 3, 4, or 5, p = 0.3, and q
= 0.7, Therefore,
P (3) = 5C3 (0.3)3 (0.7)2 = 0.132
P (4) = 5C4 (0.3)4 (0.7)1 = 0.028
P (5) = 5C5 (0.3)5 (0.7)0 = 0.002
Hence,
P (X > 3) = P(3) + P(4) + P(5)
= 0.132 + 0.028 + 0.002 = 0.162.
The above example indicates that the binomial probability formula can be tedious at times. Therefore,
binomial tables have been developed for selected values of n and p to overcome this tiresome task.
Please refer to the Eton statistical tables.
EXAMPLE 6−14
If 30% of the people in a community use the library in one year, for a sample of 15 people find
probabilities:
A. Exactly 7 used the library.
B. At least 5 used the library.
Using binomial formula mainly in part B will be very time consuming so we make use of the binomial
tables.
EXAMPLE 6−15
The probability that a patient will die whilst having a particular type of heart operation is 0.40. If 10 patients
decided to have this particular type of heart operation, what is the probability that:
A. 2 will die,
B. Almost 3 will die,
C. At least 5 will die
SOLUTION
It is found that 75% of the patients suffering from a particular disease are cured successfully. What is the
probability that 3 of the next 4 patients will be cured successfully?
SOLUTION
We can’t use the tables straightaway since = 0.75, is not in the tables. So to use the table we have
to change X to Y using Y = n — X and to 1 using 1 1 . Therefore, lookup the table with n 4,
Y 1 and 1 0.25. We get P (X = 3) = P (Y = 1) = 0.4219.
Mean np
Variance 2 npq
Standard Deviation npq
EXAMPLE 6−17
A coin is tossed 4 times. Find the mean, variance, and standard deviation of the number of head that will
be obtained.
SOLUTION
Here n=4, p=1/2, and q=1/2 and using the formulas, we have
n p 4 (1 / 2) 2
2 n p q 4 (1 / 2) (1 / 2) 1
1 1
X 0 1 2 3 4
P(X) 1 4 6 4 1
16 16 16 16 16
1 4 1
0 1 ... 4 2.
16 16 16
2 x 2 P( X x) 2
x
1 4 1
02 12 ... 42 22 1
16 16 16
and = 1 1.
6.6 Summary
In this chapter, we examined random variables and discrete probability distribution. The concepts
discussed in this chapter were: random variables; discrete probability distribution; mean, variance and
standard deviation of discrete probability distribution. Later, we also discussed of a common discrete
distribution that is the binomial distribution and used it to solve some probability problems.
1. The probability density function of a discrete random variable Y is given by P(Y = y) = cy2 for y = 0,
1, 2, 3, 4. Given that c is a constant, find the value of c.
2. The following is the probability distribution of the X the number of breakdowns per week for a machine
based on past data.
X 0 1 2 3
P (X) 0.15 0.20 0.35 0.3
Find the probability that the no. of breakdowns for this machine during a given week is:
A. Exactly 2
B. At least 2
C. At most 1
3. Find the mean, variance and standard deviation of the probability distribution in question 2.
4. According to an internet posting, 80% of adults enjoy drinking beer. Three adults are randomly
selected, and let X, be the number of adults who enjoyed drinking beer:
A. Obtain the probability distribution of X.
B. Calculate the expected value and standard deviation of X.
5. Joe is playing a game of chance at the Hibiscus festival, costing $1 for each game. In the game two
fair dice are rolled and the sum of the numbers that turn up is found. If the sum is seven, then Joe
wins $5 otherwise, Joe loses his money. Joe plays the game 15 times. Find his expected gain or loss.
6. Eight people applied for a job as assistant manager of restaurants. Five have completed college and
three have not. If a manager selects three applicants at random, construct a probability distribution
for selecting those that have completed college.
7. A shoe store’s records show that 30% of customers making a purchase use a credit card to make
payment. This morning, 20 customers purchased shoes from the store. Find the probability that at
least 2 of the customers used a credit card. (Assume independence).
8. The editor of a journal historically accepts 11 % of articles submitted for publication. Using the
binomial formula, find the probability that in a random sample of 8 articles submitted to this journal,
the editor will accept:
C. Exactly 4 for publication.
D. At least one for publication.
9. If 3% of calculators are defective, find the mean, variance and standard deviation of a lot of 400
calculators.
10. A fisherman finds that approximately 17% of all his fish go bad by the time he takes them to the
market. The fisherman catches 1,000 fish.
A. How many will go bad by the time he takes them to the market?
B. Find the standard deviation.
Objectives
After completing this chapter, you should be able to:
1. List the properties of a normal distribution.
2. Find the area under the standard normal distribution given the z – values.
3. Find the probabilities for a normally distributed random variable.
4. Find specific data values for given percentage, using standard normal distribution.
5. Use the central limit theorem to solve problems involving sample means for large sample.
7.1 Introduction
Random variables can either be discrete or continuous. Discrete random variables and their distributions
were discussed in Chapter 5. We have also examined the binomial distribution and its properties. Recall
that discrete random variables are those that are countable, on the other hand, a continuous random
variable can assume all values in an interval. Examples of continuous variables are heights of students,
body temperature of dogs and blood pressure of adults. Since continuous random can assume any value
in an interval, say 0 to 1 year, if the life of the bulb is 1 year. This interval contains an infinite numbers of
values that are uncountable.
Many continuous random variables have distributions that are bell–shaped and are called approximately
normally distributed variables. In this chapter, we will study a special continuous distribution called the
Normal distribution. Finally, this Chapter also explains a very important fact about a normal distribution
called central limit theorem.
No variable fits a normal distribution perfectly, since a normal distribution is a theoretical distribution.
However, a normal distribution can be used to describe many variables that are approximately normal.
When the data values are evenly distributed about the mean, a distribution is said to be symmetric
distribution. (Normal distribution is symmetric). When majority of the data lies to the left or right of the
mean, the distribution is said to be skewed.
A normally distributed variable X , can be transformed into the standard normally distributed variable z,
by using the formula for the z-score:
X
z ,
Where,
X = data value
= population mean
= population standard deviation
EXAMPLE 7−1
SOLUTION
Step 1: Draw a standard normal curve and shade the area on the left of 1.99.
Step 2: Look for z = 1.99 in the Eton table and we get 0.4767. The area 0.4767 obtained from the table
is the area under the curve from 0 to 1.99. Since the area on the left of 0 is 0.5, the area desired is 0.5 +
0.4767 = 0.9767. The area on the left of 1.99 can also be written as P (z <1.99) = 0.9767 and is read as
probability that z is less than 1.99 is 0.9767 or 97.67%.
SOLUTION
Step 1: Draw a standard normal curve and shade the area on the right of −1.16.
Step 2: Since z = −1.16 is not in the Eton table, look for z = 1.16 and we get 0.3830. The area 0.3830
obtained from the table is the area under the curve from −1.16 to 0. Since the area on the right of 0 is
0.5, the area desired is 0.5 + 0.3830 = 0.8830. P (z > −1.16) = 0.8830.
EXAMPLE 7−3
SOLUTION
Step 1: Draw a standard normal curve and shade the area between z = −1.37 and z = 1.68.
Step 2: Look for z = 1.37 and we get 0.4147. The area 0.4147 obtained from the table is the area under
the curve from −1.37 to 0. Then look for z = 1.68 and we get 0.4535. The area 0.4535 obtained from the
table is the area under the curve from 0 to 1.68. So the area desired is 0.4147 + 0.4535 = 0.8682.
P (−1.37 < z < 1.68) = 0.8682.
SOLUTION
Step 1: Draw a standard normal curve and shade the area between z = 1.91.
Step 2: Look for z = 1.91 and we get 0.4791. The area 0.4719 obtained from the table is the area under
the curve from 0 to 1.91. Since the area on the right of 0 is 0.5, so the area desired is 0.5 −0.4791 =
0.0281. P ( z < 1.91) = 0.0281.
EXAMPLE 7−5
Find the z value such that the area under the standard normal curve between 0 and the z value is 0.2157.
SOLUTION
Step 1: Draw a standard normal curve and shade the area between 0 and the z value to be 0.2157.
0.2157
z=0 z
Step 2: Since the area between 0 and the z value is 0.2157, then look for 0.2157 in the probability
section of the table. The z value corresponding to 0.2157 is 0.57. Therefore, the z value is 0.57. See the
diagram below.
Find the z value such that the area under the standard normal curve on the right of the z value is 0.0239.
SOLUTION
Step 1: Draw a standard normal curve and shade the area on the right of the z value to be 0.0239.
0.0239
0 z
Step 2: Find the area between 0 and the z value, which will be 0.5 − 0.0239 = 0.4761. Then look for
0.4761 in the probability section of the table. The z value corresponding to 0.4761 is 1.98. Therefore,
the z value is 1.98.
To solve the application problems, we need to know how to find the probability given the z value or find
z value given the probability.
The average annual salary for all U.S teachers is $47750. Assume that the distribution is normally
distributed and the standard deviation is $5680. Find the probability that a randomly selected teacher
earns
A. Between $35000 and $45000 a year.
B. More than $40000 a year.
SOLUTION
Let, X = annual salary of a teacher which is normally distributed with µ = 47750 and σ = 5680.
A. This probability can be written as P (35000 < X < 45000). There are two X values here, 35000 and
45000. Now convert the two X values into z using the formula:
For X 35000, z 35000 47750 2.24.
5860
For X 45000, z 45000 47750 0.48.
5860
So P (35000 < X < 45000) = P (−2.24 < z < −0.48) Now draw a standard normal curve and shade
the area between −2.24 and -0.48.
−2.24 −0.48
Look for z = 2.24 in the tables and we get 0.4875. The area 0.4875 obtained from the table is the
area under the curve from −2.24 to 0. Now look for z = 0.48 and we get 0.1844. The area 0.1844
obtained from the table is the area under the curve from -0.48 to 0. So the area desired is 0.4875 –
0.1844 = 0.3031 P (35000 < X < 45000) = 0.3031 or 30.31%.
B. This probability can be written as P (X < 40000) Now convert the X value into z using the formula:
40000 47750
X 40000, z 1.36.
5860
So P (X < 40000) = P (z > −1.36.) Now draw a standard normal curve and shade the area on the
right of −1.36.
EXAMPLE 7−8
A certain type of storage battery lasts, on the average, 3.0 years with a standard deviation of 0.5 years.
Assuming that the battery lives are normally distributed, find the probability that a given battery will last
less than 2.3 years.
SOLUTION
Let, X = the number of years a battery lasts, which is normally distributed with = 3.0 and = 0.5. This
probability can be written as P (X < 2.3). Now convert the X value into z using the formula:
2.3 3
X 2.3, z 1.4.
0.5
So P (X < 2.3) = P (z < −1.4). Now draw a standard normal curve and shade the area on the left of −1.4.
−1.4
Look for z = 1.4 and we get 0.4192. So the area desired is 0.5 – 0.4192 = 0.0808. Therefore, the
probability that a given battery will last less than 2.3 years is 0.0808 or 8.08%.
EXAMPLE 7−9
An electrical firm manufactures light bulbs that have a length of life that is normally distributed with mean
equal to 800 hours and a standard deviation of 40 hours. Find the probability that a bulb burns between
778 and 834 hours.
SOLUTION
Let, X = length of life of a bulb, which is normally distributed with = 800 and = 40. This probability
can be written as P (778 < X < 834). Converting the X values into z we get −0.55 and 0.85.
So P (778 < X < 834) = P (−0.55 < z <0.85). The area desired is:
−0.55 0.85
The time taken by the milkman to deliver to the High Street is normally distributed with a mean of 12
minutes and a standard deviation of 2 minutes. He delivers milk every day. Estimate the number of days
during the year when he takes:
A. Longer than 17 minutes.
B. Less than 10 minutes.
SOLUTION
Let, X be the time, in minutes, taken to deliver milk to the high street, which is normally distributed with
= 12 and = 2.
A. We have to find P (X > 17). Converting 17 into z value we get 2.5. So P (X > 17) = P (z >2.5). The
area desired is:
2.5
P (X > 17) = 0.5 – 0.4938 = 0.0062. To find the number of days, multiply by 365.
365 × 0.0062 = 2.263 ≈ 2.Therefore, on two days in a year he takes longer than 17 minutes.
B. We have to find P (X < 10). Converting 10 into z value we get −1. So P (X < 10) = P (z < −1). The
area desired is:
−1
P (X < 10) = 0.5 – 0.3413 = 0.1587. Now 365 × 0.1587 = 57.92 ≈ 58.
Therefore, on 58 days in a year he takes longer than 10 minutes.
An IQ test is normally distributed with mean of 400 and standard deviation of 100. The top 3% of students
receive $500 as the prize money. What is the minimum score one would need to receive this award?
SOLUTION
Step 1: Draw a standard normal curve and shade the area on the right of the z value to be 0.03.
0.03
Step 2: Find the area between 0 and the z value, which will be 0.5 − 0.03 = 0.47. Then look for 0.47 in
the probability section of the table. We don’t have 0.47 so use the closest value that 0.4699. The z
value corresponding to 0.4699 is 1.88. Now use z formula to find the value of X:
X 400
1.88 ,
100
Making X the subject, we get X = 588. Thus, anyone scoring 558 or more must be qualified.
EXAMPLE 7−12
For a medical study, a researcher wishes to select people in the middle 60% of the population based on
blood pressure. If the systolic blood pressure is normally distributed with the mean of 120 and the
standard deviation is 8, find the upper and lower readings that would qualify people to participate in the
study.
SOLUTION
Step 1: Draw a standard normal curve and shade the middle area to be 60%
60%
z0 0 z1
X 120
0.84 ,
8
Therefore, the two values of X are 113.28 and 126.72. Thus, the lower reading is 113.28 and upper
reading is 126.72.
EXAMPLE 7−13
The weights of boxes of oranges are normally distributed such that 30% of them are greater than 4kg
and 20% are greater than 4.53kg. Estimate the mean and standard deviation of the weights.
SOLUTION
We are given that P(X >4) = 0.3 and P(X >4.53) = 0.2. Using this we have to find the values of and
. Lets first consider P(X >4) = 0.3. Converting 4 into z value we get 4 , so we have
4
P( X 4) P z 0.3. Now draw a standard normal curve and shade the area on the right
of 4 to be 0.3.
0.3
4
Using the tables we obtain the z value to be 0.52. Therefore, we have the equation
4
0.52 . (1)
Similarly using P(X >4.53) = 0.2 so we get another equation
4.53
0.84 . (2)
Solving the equations (1) and (2) simultaneously we get µ = 3.12 kg and σ = 1.68 kg.
A sampling distribution of sample means is a distribution using the means computed from
all possible random samples of a specific size taken from a population.
If the samples are randomly selected with replacement, the sample means will be somewhat different
from the population mean. These differences are caused by sampling error.
Sampling error is the difference between the sample measure and the corresponding
population measure due to the fact that the sample is not a perfect representation of the
population.
The following example illustrates these two properties. Suppose a lecturer gave an 8-point quiz to a small
class of four students. The results of the quiz were 2, 6, 4, and 8. For the sake of discussion, assume
that the four students constitute the population. The mean of the population is µ = 5 and the standard
deviation of the population σ = 2.236.
Now, if all samples of size 2 are taken with replacement and the mean of each sample is found, the
distribution is as shown.
Using the table above, find the mean of the values in the 2 nd and the 4th column, therefore X 5. This
is same as the population mean, hence X .
The standard deviation of sample means, we have to find the standard deviation of the values in the 2nd
and the 4th column, so X 1.581. X is same as the population standard deviation, divided by 2.
The third property of the sampling distribution of sample means is on the shape of the distribution and is
explained by the central limit theorem.
X
z .
n
EXAMPLE 7−14
The average teacher’s salary in Fiji is $29,863. Suppose that the distribution is normal with standard
deviation of $5100.
A. What is the probability that a randomly selected teacher’s salary is less than $40,000?
B. What is the probability that the mean for a sample of 80 teacher’s salary is greater than $30,000?
Let, X be the salary of a teacher, which is normally distributed with = 29863 and = 5100.
A. We have to find P (X < 40,000). Converting 40000 into z value we get 1.99. So P (X < 40,000) = P (z
<1.99). The area desired is:
0 1.99 z
B. We have to find P( X 30,000). Since the variable X is normally distributed, the sample mean
X will have a normal distribution.
0 0.24 z
EXAMPLE 7−15
It is reported that children between 2 and 5 years old watch an average of 25 hours of TV per week.
Assume the variable is normally distributed and the standard deviation is 3 hours. If 20 children between
the ages of 2 and 5 are randomly selected, find the probability that the mean of the number of hours
they watch TV will be greater than 26.3 hours.
SOLUTION
We have to find P X 263 . Since the variable X is normally distributed, the sample mean X will
have a normal distribution.
Therefore, the probability that the mean of the number of hours they watch TV will be greater than 26.3
hours is 0.0262.
EXAMPLE 7−16
The average age of a vehicle registered in Fiji is 8 years, or 96 months. Assume the standard deviation
is 16 months. If a random sample of 36 vehicles is selected, find the probability that the mean of their age
is between 90 and 100 months.
SOLUTION
We have to find P(90 X 100). The variable X is not normally distributed, but since the sample size
is more than 30, the sample mean X will have normal distribution.
The z value of X 90 is
90 96
z 2.25 .
16 36
The z value of X 100 is
100 96
z 1.5.
16 36
Hence, P(90 X 100) P(2.25 z 1.5) 04878 0.4332 0.921.
2.25 0 1.5
Therefore, probability that the mean of their age is between 90 and 100 months is 0.921.
EXERCISES
3. In a test, the average score of the 385 students is 65, with a standard deviation of 10. Assume the
scores are normally distributed:
A. What percentage of students scored between 47 and 67?
B. What percentage of students scored 86 or greater?
4. Daily sales of petrol from the Nabua service station are normally distributed with mean 6300L and
the standard deviation 400L.
A. If a daily sale is selected at random, find the probability that it is less than 6200L.
B. If petrol sales are sampled for 40 days, and the mean is calculated, find the probability that the
sample mean is less than 6200L.
5. The marks of ST130 exam is known to be normally distributed, with a mean 51 and standard deviation
14. If 200 students take the test,
A. How many would you expect to score between 58 and 65?
B. If 5% of the students get an A+, what is the minimum mark for an A+?
6. The weights of boxes of oranges are normally distributed such that 30% of them are greater than 4kg
and 20% are greater than 4.53kg. Estimate the mean and the standard deviation of the weights.
7. It is reported that children between 2 and 5 years old watch an average of 25 hours of TV per week.
Assume the variable is normally distributed and the standard deviation is 3 hours. If 20 children
between the ages of 2 and 5 are randomly selected, find the probability that the mean of the number
of hours they watch TV will be greater than 263 hours.
Objectives
After completing this chapter, you should be able to:
1. Find the confidence interval for the mean when the population standard deviation is known.
2. Determine the minimum sample size for finding a confidence interval of mean.
3. Find the confidence interval for the mean when the population standard deviation is unknown.
4. Find the confidence interval for the population proportion.
5. Determine the minimum sample size for finding a confidence interval of proportion.
8.1 Introduction
As part of inferential statistics, we need to determine the value of the population parameters. This is not
possible since the population is large, so statisticians have to estimate the value of the parameter. An
important aspect on inferential statistics is estimation, which is the process of estimating the true value
of a population parameter from the information derived from a small sample. For instance, the population
mean ( ) can be estimated using the sample mean ( X ).
Therefore, in this chapter, we will explain statistical procedures for estimating the population mean and
proportion. Another important question in estimation is that of sample size. How large should the sample
be drawn in order to make an accurate estimate? This question is not easy to answer as it depends on
several factors, such as the accuracy desired and the probability of making a correct estimate. The
problem of determining the sample size for estimating the parameters will also be discussed in this
chapter.
8.2 Estimation
Estimation is the process of estimating the true value(s) of a population parameter from the information
derived from a small sample.
Confidence Interval
A confidence interval is a specific interval estimate of a parameter determined by using the data obtained
from sample and a specific confidence level.
Confidence Level
A Confidence Level of an interval estimate of a parameter is the probability that the interval estimate will
contain the parameter.
EXAMPLE 8−1
Suppose that a 90% confidence interval states that the population mean is greater than 100 and less
than 200. How would you interpret this statement?
SOLUTION
It means that we are 90% confident that the interval contains the true population mean.
8.3 Confidence Intervals and Sample Size for the Mean when is
known
Before constructing the confidence interval for , it is essential to know the following:
Is the distribution of the population normal or not?
Is the population standard deviation known or unknown?
Is the sample size large or small?
Our answers will then determine how to proceed. In this section we are going to construct the confidence
interval of the population mean when is known.
X z 2 X z 2 .
n n
Note:
1. If n < 30 the population should be normally distributed.
2. The values of z 2 for some confidence interval are as follows:
For the 99% confidence interval, z 2 = 2.58.
For the 95% confidence interval, z 2 = 1.96.
For the 90% confidence interval, z 2 = 1.65.
However, other values for confidence level could be given, so how do we find the value of z 2 . Let us
consider the next example.
EXAMPLE 8−2
SOLUTION
Draw a standard normal curve and shade the area 0.98 in the middle. See the graph below.
0.98
0 z 2
Use the standard normal table from the Eton tables to find the value of z 2 . Lookup 0.49 in the
probability section and read the corresponding z value. Therefore z 2 = 2.33.
Note: The value of α= 1 − 0.98 = 0.02 and α / 2 = 0/02 / 2 = 0/01. Therefore, the area on the right of
z 2 is 0.01.
A random sample of 49 shoppers showed that they spend an average of $23.45 per visit at MHCC
Bookstore. From past studies, it is known that σ = $2.80.
A. Find a point estimate of the population mean.
B. Find the 99% confidence interval of the true mean.
SOLUTION
EXAMPLE 8−4
Suppose a registrar of the University of the South Pacific (USP) wishes to estimate the average number
of hours per day of distractions (phone calls, emails, impromptu visits, etc.) experienced by USP lecturers.
A study of random sample of 50 lecturers in USP found that the average distraction time is 1.8 hours per
day and the population standard deviation was 20 minutes. Estimate the true mean population distraction
time for USP lecturers with 90% confidence.
SOLUTION
For 90% confidence interval, z 2 1.65. We have σ = 0.33, X 1.8 and n = 50, then 99% confidence
interval for is
0.33 0.33
1.8 1.65 1.8 1.65
50 50
1.72 1.88.
Hence one can say with 90% confidence that the average distraction time for a USP lecturer is between
1.72 and 1.88 hours per day, based on 50 lecturers.
Sample Size
Quite often, researchers need to know how large the sample is necessary to make an accurate estimate.
One may ask why sample size is so important. The answer to this is that an appropriate sample size is
required for validity. If the sample size is too small, it will not yield valid results. An appropriate sample
size can produce accuracy of results. Moreover, the results from the small sample size will be
questionable. A sample size that is too large will result in wasting money and time.
n
E
Where,
E is called the margin of error.
EXAMPLE 8−5
A pizza shop owner wishes to find the 95% confidence Interval of the true mean cost of a large plain
pizza. How large should the sample be if she wishes to be accurate to within $0.15? A previous study
showed that the population standard deviation of the price was $0.26.
SOLUTION
z 2 (1.96)(0.26)
2 2
n
E 0.15
11.5.
Therefore, the minimum sample size should be 12 to estimate the population mean with 95%
EXAMPLE 8−6
A researcher in Fiji wishes to estimate within $300 the true average amount of money Fiji spends on road
repairs each year. The standard deviation is known to be $900. If she wants to be 90% confident, how
large a sample is necessary?
SOLUTION
2
(1.65)(900)
n
300
24.5.
Therefore, the minimum sample size should be 25 to estimate the population mean with 90% confidence.
However when is unknown, a t-distribution is used. If n < 30, then the population should be normally
distributed.
If you still confused when to use z or t distribution, see the diagram below.
Yes Is No
Known?
s s
X t 2 X t 2 .
n n
The values for t 2 can be found from the t-distribution table from the Eton Tables.
Find the t 2 value for a 90% confidence interval of population mean, when the sample size is 20.
SOLUTION
For the 90% confidence interval, α= 0.10, thus α/2 = 0.05. Since n = 20, d. f. = 20 1 = 19, so look up the
t-distribution table with ν = 19, 2p = 0.1 and p = 0.05 and we get t 2 to be 1.729.
EXAMPLE 8−8
For a group of 20 ST130 students subjected to a stress situation, the mean number of heart beats per
minute was 126, and the standard deviation was 4. Find the 95% confidence interval of the true mean.
Assume the variable is normally distributed.
SOLUTION
Since the population standard deviation, is unknown, we use the t-distribution. For the 95%
Confidence Interval, α = 0.05 α/2 = 0.025 and the d.f. = 20 −1 = 19, so look up the t-distribution table
from the Eton tables with ν = 19, 2p = 0.05 and p = 0.025 and we get t 2 to be 2.093. Now the 95%
confidence interval is:
4 4
126 2.093 126 2.093
20 20
124 127.
A sample of 10 observations taken from a normal population produced the following data.
44 52 31 48 46 39 47 36 41 56
SOLUTION
B. Similarly, is unknown, so we use the t-distribution. Look up the t-distribution table with ν = 9,
2p = 0.05 and p = 0.025 and we get t 2 to be 2.262. Hence the 95% confidence interval is:
7.5 7.5
44 2.262 44 2.262
10 10
38.64 49.36.
The population proportion, denoted by p , is the proportion of population units that possess a
characteristic. The population proportion is given by:
X
p ,
N
Where,
X is the number of population units that possess a characteristic
N is the population size
q = 1 –p, is the proportion of population units that do not possess a characteristic
For example, in the USP assessment meeting, the ST130 lecturer stated that 75% of ST130 students
pass the course last semester. The parameter 65% is a population proportion.
The population proportion, p , is often unknown, so a sample proportion, denoted as p̂ (read p hat) is
used to estimate it. It represents the proportion of sample units that possess a characteristic. The sample
proportion is given by:
x
pˆ ,
n
EXAMPLE 8−10
In a study, 400 students were interviewed if they own a computer; 352 said that they had computers. Find
ˆ and qˆ.
p
SOLUTION
352
Here n 400, x 352, pˆ 0.88 and qˆ 1 0.88 0.22.
400
We can say that for this sample 88% of students surveyed owned a computer.
pq
pˆ .
n
This formula is used when n N 005.
3. By central limit theorem, the sampling distribution of pˆ , is approximately normal for a sufficiently
large sample size (that is np 5 and nq 5 ) with a mean of p and standard deviation of
pq
.
n
4. Therefore, the z -value of p̂ is given by
pˆ p
z .
pq
n
ˆˆ
pq ˆˆ
pq
pˆ z 2 p pˆ z 2
n n
EXAMPLE 8−11
A recent study of 100 people in Fiji found 27 were obese. Find the 95% confidence of the population
proportion of all individuals living in Fiji who are obese.
SOLUTION
For 95% confidence interval, z 2 1.96. We have n 100, pˆ 27 / 100 0.27, and
qˆ 1 0.27 0.73 , then 95% confidence interval for p is:
(0.27)(073) (0.27)(073)
0.27 1.96 p 0.27 1.96
100 100
0.183 p 0.357.
Hence, one can be 95% confident that the proportion of people obese in Fiji is between 18.3% and 35.7%.
EXAMPLE 8−12
A survey of 120 female freshmen showed that 18 did not wish to work after marriage. Find the 90%
confidence interval of the true proportion of females who do not work after marriage.
SOLUTION
For 90% confidence interval, z 2 1.65. We have n 120, pˆ 18 /120 0.15, and qˆ 0.85 , then
90% confidence interval for p is:
(0.15)(085) (0.15)(085)
0.15 1.65 p 0.15 1.65
120 120
0.096 p 0.204.
Hence, we can say with 90% confident that between 9.6% and 20.4% of females do not work after
marriage.
2
z
ˆ ˆ 2
n pq
E
Where,
E is called the margin of error
EXAMPLE 8−13
It is believed that 10% of Suva homes have a direct satellite television receiver (SKY Pacific). How large
a sample is necessary to estimate the true population of homes which do with 90% confidence and within
3 percentage points?
SOLUTION
For 90% confidence interval, z 2 1.65. Here pˆ 0.1, qˆ 0.9, E 0.03, hence
2
z
ˆ ˆ 2
n pq
E
2
1.65
(0.1)(0.9) 272.25.
0.03
EXAMPLE 8−14
A researcher wishes to estimate the proportion of executives who own a car phone. She wants to be 99%
confident and be accurate within 5% of the true proportion. Find the minimum sample size necessary.
SOLUTION
For 99% confidence interval, z 2 2.58. In this problem, we have no prior knowledge of p̂ and so we
assign pˆ 0.5 and therefore, qˆ 05 . Hence,
2
2.58
n (05)(05) 665.64.
005
EXERCISES
2. A recent survey of 8 social networking sites has a mean of 13.1 and a standard deviation of 4.1
million visitors for a specific month. Find the 95% confidence interval of the true mean. Assume that
the variable is normally distributed.
3. If the variance of a national accounting exam is 900, how large a sample is needed to estimate the
true mean score within 5 points and with 99% confidence?
4. The number of unhealthy days based on the AQI (air quality index) for a random sample of
metropolitan areas is shown:
61 12 6 40 27 38 93 5 13 40
A. What is the point estimate of the mean number of unhealthy days all such days?
B. Construct a 98% confidence interval of based on the data.
5. A sample of 30 networking sites for a specific month has a mean of 26.1. Assume the population
standard deviation to be 4.2. Find the 99% confidence interval of the true mean.
6. A recent study indicated that 29% of the 100 women over age 55 in the study were widows. How
large a sample must you take to be 90% confident that the estimate is within 0.05 of the true
proportion of women over age 55 who are widows?
7. A Tongan advertising agency wishes to estimate the proportion of household, which use a particular
brand of washing soap. They decide on the sample size of 500 and find that 157 households use the
product.
A. Construct a 99% confidence interval for proportion.
B. How large should a sample have to be for their interval estimate of proportion to have been in
error by 2%?
8. In a survey of drug use among 995 Suva teenagers, the following results were reported. Estimate
with 90% confidence the proportion of all Suva teenagers who are daily smokers or occasional
smokers.
Objectives
After completing this chapter, you should be able to:
1. State the null and alternate hypothesis.
2. Test means when population standard deviation is known, using the z-test.
3. Test hypothesis using a p-value method.
9.1 Introduction
Researchers are interested in answering many types of questions. For example, a scientist might want
to know whether the earth is warming up. A physician might want to know whether a new medication will
lower a person’s blood pressure. An educator might wish to see whether the new method of teaching is
better than a traditional one. Automobile manufacturers are interested in determining whether seat belts
reduce the severity of injuries caused in accidents. These types of questions can be addressed through
statistical hypothesis testing, which is a decision is making process for evaluating claims about a
population parameter.
In this chapter, we will discuss the basic concepts of hypothesis testing. We will also discuss the
hypothesis testing procedure for population mean using a z-test and the different methods of hypothesis
testing that are traditional, P-value and confidence interval method.
2. Alternative hypothesis
The alternative hypothesis, denoted by H1 , is a statistical hypothesis that states the existence of a
difference between a parameter and a specific value or states that there is difference between two
parameters.
CASE 9−1
A medical researcher is interested in finding out whether a new medication will have any undesirable side
effects. The researcher is particularly concerned with the pulse rate of the patients who take the
medication. Will the pulse rate increase, decrease, or remain unchanged after a patient takes the
medication? In the past, the mean pulse rate is 82 beats per minute.
In this case, the researcher wants to study whether mean pulse rate 82 or not. Therefore, the
hypothesis for this situation are:
H0 : 82
H1 : 82
This test is called a two-tailed test because there is a not equal sign in the alternate hypothesis.
Note: While writing the hypothesis, you should remember the following:
1. If you are testing 0 or 0 or 0 , it must go to null hypothesis H 0 and its
complementary i.e. 0 or 0 or 0 , respectively will go to H1 .
CASE 9−2
A chemist invents an additive to increase the life of an automobile battery. The mean lifetime of the
automobile battery is 36 months. In this case you are testing 36 , so it goes to H1 and its
complementary i.e. 36 or 36 will go to H 0 . Therefore, the hypothesis for this situation are:
H0 : 36
H1 : 36
This test is called right-tailed because there is a greater than sign in the alternate hypothesis.
A contractor wishes to lower heating bills by using a special type of insulation in houses. If the average
of the monthly heating bill is $78, her hypothesis about heating costs with use of insulation are:
H0 : $78
H1 : $78
This test is a left-tailed test because there is a less than sign in the alternate hypothesis.
Decision on H 0
Nature of H 0 Reject H 0 Accept H 0
The above table shows that we are liable to commit the following two types of errors:
i. Rejection of the null hypothesis ( H 0 ) when it is true, is called a type I error.
ii. Acceptance of the null hypothesis ( H 0 ) when it is false, is called a type II error.
P Rejct H 0 H 0 is true .
P Accept H 0 H 0 is false .
The critical value, critical region and noncritical region of a two-tailed test, a right-tailed test and a left
tailed test are shown in the following figures.
X
z , when is known.
n
EXAMPLE 9−1
A survey claims that the average cost of hotel room in Fiji is $69.21. To test the claim, researcher selects
a sample of 30 hotel rooms and finds that the average cost is $68.43. The standard deviation of the
population is $3.72. At = 0.05, is there enough evidence to reject the claim?
SOLUTION
We need to test whether, = $69.21 (claim), which should be stated in null hypothesis.
Critical Critical
region Acceptance region
region
−1.96 1.96
EXAMPLE 9−2
A researcher reports that the average salary of assistant professors (AP) is more than $42,000. A
sample of 30 AP has a mean salary of $43,260. At = 0.05, test the claim that AP earn more than
$42,000/yr. It is known that = $5,230.
SOLUTION
We need to test here, > $42,000 (claim), which should be stated in alternative hypothesis.
Critical
Acceptance region
region
1.65
EXAMPLE 9−3
A national magazine claims that the average college student watches less television than the general
public. The national average is 29.4 hours per week, with a standard deviation of 2 hours. A sample of
30 college students has a mean of 27 hours. Is there enough evidence to support the claim at = 0.01?
SOLUTION
We need to test here, < 29.4 (claim), which should be stated in alternative hypothesis.
Step 1: State the hypothesis
H0 : 29.4
H1 : 29.4(claim)
−2.33
i. Traditional method,
ii. P –value method, and
iii. Confidence Interval method.
Calculating P-value
The P-value is obtained from the standard normal curve as follows:
If left-tail test, the P-value is the area on the left of the test value.
If right-tailed test, the P-value is the area on the right of the test value.
If two-tailed test, the P-value is twice the area on the left/right of the test value.
EXAMPLE 9−4
A survey claims that the average cost of hotel room in Fiji is $69.21. To test the claim, researcher selects
a sample of 30 hotel rooms and finds that the average cost is $68.43. The standard deviation of the
population is $3.72. At = 0.05, is there enough evidence to reject the claim? Use the P-value method.
SOLUTION
We need to test whether, = $69.21 (claim), which should be stated in null hypothesis.
Step 1: State the hypothesis
H0 : $69.21(claim)
H1 : $69.21
−1.15
Using the table, the area on the left of z = −1.15. is 0.1251. Since this is a two-tailed test, the P-value is
2(0.1251) =0.2502.
EXAMPLE 9−5
A researcher wishes to test the claim that the average age of lifeguards in Ocean City is greater than 24
years. She selects a sample of 36 guards and finds the mean of the sample to be 24.7 years and the
population standard deviation is assumed to be 2 years. Is there evidence to support the claim at =
0.05? Use the P-value method.
SOLUTION
We need to test whether, > 24 (claim), which should be stated in alternate hypothesis.
2.10
Using the table the area on the right of z = 2.10 is 0.0179. Since this is a right-tailed test, the P-value is
0.0179.
Step 4: Make a decision to reject or do not reject null hypothesis. Since the P-value is less than 0.05,
the decision is “reject H 0 ” .
EXAMPLE 9−6
Sugar is packed in 5 lbs bags. An inspector suspects the bags may not contain 5 lbs. A sample of 50
bags produces a mean of 4.6 lbs and assume the population standard deviation is 0.7 lbs. Is there enough
evidence to conclude that the bags do not contain 5 lbs as stated at = .05? Use confidence interval
method.
SOLUTION
9.5 Summary
This chapter introduces the concept of hypothesis testing. The concepts discussed in this chapter are as
follows: statistical hypothesis; null and alternate hypothesis; statistical test; type I and type II error; level
of significance; critical and non-critical region; z-test for mean; methods of hypothesis testing.
1. Define null hypothesis and alternate hypothesis, and give an example of each.
2. Write the null and alternative hypothesis for each of the following examples. Determine if each is a
case of a two-tailed, a left-tailed, or a right-tailed test.
A. To test if the mean amount of time spent per week watching sports on television by all adult men
is different from 9.5 hours.
B. To test if the mean amount of money spent by all customers at a supermarket is less than $105.
C. To test whether the mean starting salary of college graduates is higher than $39000 per year.
D. To test if the mean waiting time at the drive-through window at a fast food restaurant during rush
hour is at least 10 minutes.
3. The average 1-year old is 29 inches tall. A random sample of 30 1-year olds in a large day care
resulted in the following heights. At α = 0.05, can it be concluded that the average height differ from
29 inches? Assume σ = 2.61.
4. A researcher claims that adult dogs fed a special diet will have an average weight of 200 lbs. A
sample of 40 dogs has an average weight of 198.2 lbs and a standard deviation of 3.3 lbs.
A. At α = 0.05 can the claim be rejected? Use traditional method.
B. Also, find the 95% confidence interval of the true mean and verify the result in part A above.
5. A Pacific Tapioca manufacturer claims that the packets of tapioca chips they make have a mean
weight of 980g. The standard deviation of the weights is known to be 15g. A random sample of 150
packets has a mean weight of 985g. Does this result support the manufacturer claim? Use α = 0.1
and the P-value method to test this.
6. The average production of sugarcane in Fiji is 3000 pounds per acre. A new plant food have been
developed and is tested 60 individual plots of land. The mean yield with new plant food is 3120
pounds of sugarcane per acre, and the population standard deviation is 578 pounds. At 0.05,
can you conclude that the average production has increased?
HYPOTHESIS TESTING
(PART II)
Objectives
After completing this chapter, you should be able to:
1. Test means when population standard deviation is unknown, using the t-test.
2. Test proportions, using a z-test.
10.1 Introduction
In the previous section, we have discussed the basic concepts of hypothesis testing, the z-test for testing
the population mean and the different methods to test hypothesis. In this Chapter, we will discuss the t-
test for population mean and the z-test for the population proportion.
However if is unknown for any sample size (If n < 30, the variable must be normally distributed), the t-
test is used to test the population mean.
Test Statistic
The value of test statistic is obtained by:
X
t , when is not known.
s n
EXAMPLE 10−1
A job placement director claims that the average starting salary for nurses is $24,000. A sample of 10
nurses has a mean of $23,450 and a standard deviation of $400. Is there enough evidence to reject the
director’s claim at = 0.05? Assume the variable must be normally distributed.
SOLUTION
We need to test here, = 24000 (claim), which should be stated in null hypothesis.
Critical
region
Critical
Acceptance region
region
−2.365 2.365
EXAMPLE 10−2
An MP claims that the average number of acres in his province’s State Parks is less than 2000 acres. A
random sample of five parks is selected and the number of acres is shown below. Assume the variable
must be normally distributed.
SOLUTION
We need to test here, 2000 (claim), which should be stated in alternate hypothesis.
Critical
region Acceptance
region
−3.747
Test statistic
If a large sample of size n is drawn for testing a population proportion, the value of test statistics ( z test)
is given by:
p̂ p
z
pqn
Where,
X
pˆ (sample proportion)
n
p population proportion
n sample size
An educator estimates that the dropout rate for seniors at high schools in a particular city 15%. Last year,
38 seniors from a random sample of 200 seniors withdrew. At = 0.05, is there enough evidence to
reject the educator’s claim? Use tradition method.
SOLUTION
We need to test here, p = 0.15 (claim), which should be stated in null hypothesis.
Step 1: State the hypothesis
H 0 p 0.15 (claim)
H1 p 0.15
Step 2: Find the critical value.
Since 005 and the test is two–tailed, so the area on the left tail and right tail is 0.025. Draw a
standard normal curve and find the z-values using the tables from the Eton tables. The z-values are z =
+1.96. So the critical values are z = +1.96. See the diagram below.
Critical
region Acceptance Critical
region region
−1.96 1.96
A recent study found that, at most, 32% of people who have been in a plane crash have died. In a sample
of 100 people who were in a plane crash, 38 died. Should the study’s claim be rejected? Use = 0.05.
Use tradition method.
SOLUTION
We need to test here, p 0.32 or p 0.32 (claim), which should be stated in null hypothesis.
Step 1: State the hypothesis
H 0 p 0.32 (claim)
H1 p 0.32
Critical
Acceptance region
region
1.65
At a large university, a study found that no more than 25% of the students who commute travel more than
14 miles to campus. At = 0.10, test the findings that if in a sample of 100 students, 30 drove more than
14 miles. Use the P-value method.
SOLUTION
1.15
Using the table, the area on the right of z = 1.15 is 0.1251. Since this is a right-tailed test, the P-value is
0.1251.
10.4 Summary
This chapter discusses the t-test for mean and the z-test for population proportion.
1. An attorney claims that more than 25% of all lawyers advertise. A sample of 200 lawyers in a certain
city showed that 63 had used some form of advertising. At = 0.05 is there enough evidence to
support the attorney’s claim? Use the P-value method.
2. A recent survey found that 68% of the populations own their homes. In a random sample of 150
heads of households, 92 responded that they owned their homes. At = 0.01 level of significance,
does that suggest a difference from the national proportion? Use traditional method.
3. The average family size was reported as 3.18. A random sample of families in a particular school
district resulted in the following family sizes:
5 4 5 4 4 3 6 4 3 3
5 6 3 3 2 7 4 5 2 2
3 5 2 2
At = 0.05, does the average family size differ from the national average? To test the claim:
A. Use a confidence interval method.
B. Use a traditional method.
4. A researcher in Vanuatu claims that a factory worker in Vanuatu earns an average of $700 per week.
A sample of 400 factory workers in Vanuatu showed that they earn an average of $685 per week
with a standard deviation of $125. Using = 0.01, can you conclude that there is evidence to support
the researcher’s claim? Use the confidence interval method.
5. A food company is planning to market a new type of frozen yogurt. However, before marketing this
yogurt the company wants to find want percentage of the people like it. The company’s management
has decided it will market this yogurt only if at least 35% of the people like it. The company’s research
department selected a random sample of 400 persons and asked them to taste this yogurt. Of these
400 persons, 112 said they liked it. Testing at the 2.5% significance level, can you conclude that the
company should market this yogurt? Use traditional method.
Objectives
After completing this chapter, you should be able to:
1. Test the difference between two sample means using, the z-test.
2. Test the difference between two means for independent samples, using the t-test.
11.1 Introduction
The basic concepts of hypothesis testing were explained in Chapter 8. With the z and t tests, a sample
mean or proportion can be compared to a specific population mean or proportion.
There are, however, many instances when the researchers wish to compare two sample means, using
experiments and control groups. For example, the average lifetimes of two different brands of bus tires
might be compared to see whether there is any difference in the tread wear. Two different brands of
fertilizer might be tested to see whether one is better than the other for growing plants.
In comparing of the means, the same basic steps for hypothesis testing are used and z and t-tests are
also used. When comparing two means by using t-test, the researcher must decide whether the samples
are independent or dependent.
To test the difference between two means we have to know whether the two samples drawn from the
populations are dependent or independent, large or small and the population standard deviations
known or unknown.
11.2.2 Hypothesis
If we wish to decide whether the means of the populations from where two independent samples were
selected are really different or same, then the null hypothesis is H0 : 1 2 (i.e. the means are not
different) and the alternative hypothesis could be any one of the following:
Test Statistic
The value of test statistic if 1 and 2 are known:
z
X 1 X 2 1 2
.
12 22
n1 n2
If 1 and 2 are known but the sample sizes are small (population normally distributed) the value of test
statistic will be same.
12 22 12 22
X1 X 2 Z /2 n1
n2
1 2 X 1 X 2 Z /2
n1
n2
.
A survey found that the average hotel room rate in FJ is $88.42 and the average room rate in NZ is
$80.61. Assume that the data were obtained from two samples of 50 hotels each and that the population
standard deviations were $5.62 and $4.83 respectively. At = 0.05, can it be concluded that there is no
significant difference in the rates?
SOLUTION
−1.96 1.96
z
X 1 X 2 1 2
8842 8061 0 745
12 22 5622 4832
50 50
n1 n2
SOLUTION
A. The P-value is approximately equal to 0. Since the P-value is less than 0.05, we reject null hypothesis.
B. Since α = 0.05, we have to construct 95% confidence level of 1 2 . Substituting into the formula
one gets:
8842 8061 1.96 5622 4832 8842 8061 1.96 5622 4832
1 2
50 50 50 50
5.76 1 2 9.86.
Since the confidence interval does not contain zero, one would reject the null hypothesis.
EXAMPLE 11−3
The data shown are the rental fees (in dollars) for two random samples of apartment in a large city. At
𝛼 = 0.10, can it be concluded that the average rental fees for apartments in the east are greater than
the average rental fee in the west? Assume 1 119 and 2 103 .
East West
495 390 540 445 420 525 400 310 375 750
410 550 499 500 550 390 795 554 450 370
389 350 450 530 350 385 395 425 500 550
375 690 325 350 799 380 400 450 365 425
475 295 350 485 625 375 360 425 400 475
275 450 440 425 675 400 475 430 410 450
625 390 485 550 650 425 450 620 500 400
685 385 450 550 425 295 350 300 360 400
SOLUTION
Critical
Acceptance
region
region
1.28
z
X 1 X 2 1 2
477.43 437.35 0 1.61
12 22 1192 1032
40 40
n1 n2
Test Statistic
The value of test statistic is:
t
X 1 X 2 1 2
s12 s22
n1 n2
EXAMPLE 11−4
The average size of a farm in Ba is 191 acres. The average size of a farm in Nadi is 199 acres. Assume
the data were obtained from two samples with standard deviations of 32 and 12 acres, respectively and
sample sizes 8 and 10, respectively. Can it be concluded at = 0.05 that the average size of the farm
in the two districts in Fiji is different? Assume the populations are normally distributed.
SOLUTION
Critical Critical
region Acceptance region
region
−2.365 2.365
t
X 1 X 2 1 2
191 199 0
0.67.
s12 s22 322 122
8 10
n1 n2
EXAMPLE 11−5
The mean age of a sample of 25 people who were playing soccer is 48.7 years, and standard deviation
is 6.8 years. The mean age of a sample of 35 people who were playing rugby is 55.3 years with a standard
deviation is 3.2 years. Can it be concluded at = 0.05 that the mean age of those playing soccer is less
than those playing rugby. Assume the populations are normally distributed.
SOLUTION
Critical Acceptance
region region
−1.711
t
X 1 X 2 1 2
48.7 55.3 0
4.509.
s12 s22 6.82 3.22
25 35
n1 n2
EXERCISES
1. A researcher claims that the average yearly earnings of male college graduates (with at least a
bachelor’s degree) is different from the average yearly earnings of female college graduates with the
same qualifications. Based on the results below, can it be concluded that there is difference in mean
earnings between male and female college graduates? Use the 0.01 level of significance.
Male Female
Sample mean $59,235 $52,487
Population standard deviation $8,945 $10,125
Sample size 40 35
2. A researcher wishes to see if there is a difference in the cholesterol levels of two groups of men. A
random sample of 30 men between the ages of 25 and 40 is selected and tested; the average
cholesterol level was 223 with standard deviation of 6.1. A second sample of 25 men between ages
of 41 and 56 is selected and tested; the average cholesterol level for this group was 229 with standard
deviation of 5.8. Assume the populations are normally distributed and the population standard
deviations are unequal. At 0.01, is there a difference in the cholesterol levels between the two
groups? Use traditional method.
3. The mean height of 20 male athletes in Fiji was 68.2 inches, while 20 male non- athletes in Fiji had
a mean height of 67.5 inches and that the population standard deviations were 2.5 inches and 2.8
inches, respectively. Assume the populations are normally distributed. Test the hypothesis that
athletes are taller than non- athletes at 5% level of significance, using:
A. P-value method.
B. Verify the solution in Part A using confidence interval method.
4. A sample of 35 chemists from Lautoka city shows an average salary of $39,420 with a standard
deviation of $1659, while a sample of 40 chemists from Suva city has an average salary of $30,215
with a standard deviation of $4116. Is there a significant difference between the two cities chemists’
salaries at 0.02?
5. A researcher claims that the mean of the salaries of primary school teachers is greater than the mean
of the salaries of secondary school teachers in Fiji. The mean of the salaries of a sample of 26 primary
school teachers is $48,256, and the sample standard deviation is $3,912.40. The mean of the salaries
of a sample of 24 secondary school teachers is $45,633, and the sample standard deviation is
$5,533. Assume the populations are normally distributed and the population standard deviations are
unequal. At = 0.05 can it be concluded that the mean of the salaries of the primary school teachers
is greater than the mean of the salaries of the secondary school teachers?
CORRELATION AND
REGRESSION
Objectives
After completing this chapter, you should be able to:
1. Draw a scatter plot.
2. Compute the correlation coefficient.
3. Test the correlation coefficient.
4. Compute the equation of the regression line.
5. Use the concept of multiple regression.
12.1 Introduction
Another area of inferential statistics involves determining whether a relationship between two or more
quantitative (numerical) variables exists. For example, an educator may want to know whether there is
any relationship between the number of absences and the student’s final grade for a student in her class.
A scientist would be interested in knowing whether there is any relationship between age and blood
pressure of a person.
This chapter considers the relationship between two variables, which can be studied by the correlation
and the regression analysis. Correlation measures how strongly two variables are related and on the
other hand, by regression analysis a model using these two variables is fitted which helps to predict a
value of a variable when the value of other variable is known. For example, correlation can be used by
an economist to find out how strongly income and expenditure of a household are related and regression
can fit a model to predict the expenditure of a house hold for a given income.
There are two types of regression: simple and multiple. In simple regression, there are two variables; an
independent variable, also called explanatory variable or a predictor variable, and a dependent variable,
also called a response variable. In simple regression, the independent variable is used to predict the
dependent variable. In multiple regressions, two or more independent variables exist with only one
dependent variable.
12.2 Correlation
If the change in one variable affects a change in the other variable, then the variables are said to be
correlated and the association between the two variables is known as correlation. In a simple
regression studies, the researcher collects data on two quantitative variables to see whether a
relationship exists between the variables. For example, if a researcher wishes to see whether there is
a relationship between the age and blood pressure of a person, he must select a random sample of
people; record their age and their blood pressure. A table can be made as shown below.
The two variables for this study are called independent and dependent variable. The independent
variable is the one that can be controlled or manipulated. In this case, the age of a person is the
independent variable and is denoted as x . The dependent variable is the one that cannot be controlled
or manipulated and in this case the blood pressure of a person is the dependent variable and is denoted
as y.
Positive correlation
If the changes of the variables are in same direction i.e. the increase (or decrease) in one variable
affects in increasing (or decreasing) the other variable, then the variables are positively correlated. For
example, (i) height and weight of persons, (ii) income and expenditure of households, etc. are positively
correlated.
Negative correlation
If the changes in the variables are in opposite direction i.e. the increasing (or decreasing) in one
variable decreases (or increases) the other, then the variables are negatively correlated. For example,
(i) price and demand of commodities, (ii) no. of absences and final exam mark, etc. are negatively
correlated.
No correlation
If two variables are independent of each other and not related in any fashion, then there cannot be any
correlation between the variables. For example, the correlation between:
height and incomes of individuals,
marriage rate and the agricultural production rate in a country, and
The size of shoe and intelligence of a group of individuals should be zero.
y y
x x
If the points seem to form a pattern with a downward slope, then the variables are said to be
negatively correlated.
y y
x x
If the points do not form any pattern with downward or upward slope, then the variables are said to be
uncorrelated.
Construct a scatter plot for the data obtained in a study of age and systolic blood pressure of six
randomly selected subjects.
SOLUTION
160
150
140
pressure
130
120
110
100
30 40 50 60 70 80
Age
The above scatter diagram indicates that there is a positive correlation between the age and the blood
pressure.
EXAMPLE 12−2
Marks of eight students who sat an examination in English and Mathematics are given by
Maths ( x) 35 35 40 45 50 50 60 69
English ( y ) 50 40 30 65 35 50 50 40
The above scatter diagram indicates that there is no correlation the variables.
n xy x y
r ,
n x 2 x 2 n y 2 y 2
Where,
n is the number of data pairs.
Note:
The values of r is always between –1 and +1, that is, 1 r 1.
r is close to 1, there is a strong positive relationship,
r is close to –1, there is a strong negative relationship,
r is close to 0, there is a little or no relationship. See the diagram below.
SOLUTION
x y xy x2 y2
43 128 5504 1849 16384
48 120 5760 2304 14400
56 135 7560 3136 18225
61 143 8723 3721 20449
67 141 9447 4489 19881
70 152 10640 4900 23104
x 345 y 819 xy 47634 x 2
20399 y
2
112443
With n = 6,
This shows there is a strong positive linear correlation between the two variables, age and blood
pressure.
EXAMPLE 12−4
No. of absences, x 6 2 15 9 12 5 8
Final exam mark, y 82 86 43 74 58 90 78
SOLUTION
x y xy x2 y2
6 82 492 36 6724
2 86 172 4 7396
15 43 645 225 1849
9 74 666 81 5476
12 58 696 144 3364
15 90 450 25 8100
8 78 624 64 6089
x 57 y 511 xy 3745 x 579
2
y
2
38993
With n = 7,
This shows there is a negative linear correlation between the two variables, number absences and final
exam mark of a student.
Hypotheses
Test Statistic
If both variables are normally distributed, then the value of the test statistic for testing H0 : 0,
calculated by:
n2
tr ,
1 r2
EXAMPLE 12−5
Test the significance of the correlation coefficient for the age and blood pressure data.
SOLUTION
In Example 12−3, we obtained r = 0.897. This shows there is a strong positive linear correlation between
age and blood pressure in the sample data. To conclude the same for the population we have to carry out
hypothesis testing.
Hypotheses
Critical value
Since the value of alpha is not given, we use 𝛼 = 0.05 and d. f = 6 −2 = 4. Looking at t-distribution table
from the Eton Table with 4 and 2 p = 0.05 (two tailed test) we have the critical value, t /2 2.776.
Test Statistic:
Conclusion: Since the test value t = 4.059 is in the critical region, H 0 is rejected at 5% level of
significance. Hence, there is significant correlation between age and blood pressure.
EXAMPLE 12−6
Test the significance of the correlation coefficient for the number of absences and final exam mark data,
using 𝛼 = 0.01.
SOLUTION
In Example 12−4, we obtained r = 0.944. This shows there is a negative linear correlation between the
variables in the sample data. To conclude the same for the population we have to carry out hypothesis
testing.
Hypotheses
Critical value:
Since 0.01 and d . f 7 2 5. . Looking at t-distribution table from the Eton tables with 5
and 2 p = 0.01 (two tailed test) we have the critical value, t /2 4.032.
Test Statistic:
72
t 0.944 6.398.
1 (0.944) 2
Conclusion: Since the test value t = -6.398 is in the critical region, H 0 is rejected at 1% level of
significance. Hence, there is significant correlation between the variables.
A manager wishes to find out whether there is a relationship between the age of employees and the
number of sick days they take each year. The manager selects a sample randomly 6 of his employees
and the data are as follow:
Age, x
18 26 39 48 53 58
Days, y
16 12 9 5 6 2
Test whether the correlation between the age of employees and the number of sick days is
significant at 5% level of significance.
SOLUTION
We have
6 1625 242 50
r 0.979.
6 10998 242 2 6 546 50 2
Hypotheses
Critical value:
Test Statistic:
62
t 0.979 9.604.
1 (0.979) 2
Conclusion: Since the test value t = -9.604 is in the critical region, H 0 is rejected at 5% level of
significance. There is a significant relationship between a person’s age and the number of sick days that
a person takes each year.
a
y x x xy ,
2
b
n xy x y
,
n x x n x2 x
2 2 2
Where,
a is called the intercept and
b is the slope of the regression line.
EXAMPLE 12−8
Find the equation of the regression line for the data in Example 12−1. Use the regression line to predict
the blood pressure of a person who is 50 years old.
SOLUTION
a
819 20399 345 47634 81.048 6 47634 345 819
and b 0.964.
6 20399 345 6 20399 345
2 2
Hence the equation of the regression line is: y ' 81.048 0.964 x. The blood pressure of a person who
is 50 years old is: y' 81.048 0.964(50) 129.
EXAMPLE 12−9
For the data in Example 12-7, find the equation of the regression line. Also, predict y when the age (x)
of an employee is 47 years.
SOLUTION
a
50 10998 242 1625 21.099 6 1625 242 50
and b 0.317.
6 10998 242 6 10998 242
2 2
Hence the equation of the regression line is: y ' 21.099 0.317 x. The number of sick days for an
employee who is 47 years old is y ' 21.099 0.317(47) 6.22 6 days.
Coefficient of Determination
We now know how to construct a linear regression model, but:
How good is the regression model?
How well does the independent variable explain the dependent variable in the regression model?
The coefficient of determination is one concept that answers this question. The square of the correlation
coefficient is known as the coefficient of determination, that is:
It gives us the proportion of total variation is explained (accounted for) by the use of regression model. If
r2 is very close to 1 then you know your model is very good to predict the y.
EXAMPLE 12−10
The following data represent trends in cigarette consumption (x) per capita and lung cancer
mortality rate (y) in a county.
Consumption (x) 11.8 12.5 15.7 19.2 21.9 23.3
Mortality rate (y) 10.4 16.5 22.9 26.6 33.8 42.8
SOLUTION
x y xy x2 y2
11.8 10.4 122.72 144 324
12.5 16.5 206.25 100 289
15.7 22.9 359.53 196 529
19.2 26.6 510.72 121 361
21.9 33.8 740.22 144 400
23.3 42.8 997.24 81 225
Hypotheses
Critical value:
Since 0.05 and d. f . = 4, so the critical value, t /2 2.776.
Test Statistic:
62
t 0.971 8.12.
1 (0.971)2
Conclusion: Since the test value lies in the critical region, H0 is rejected at 5% level of significance.
Hence, we may conclude that the correlation between the cigarette consumption per capita and lung
cancer mortality rate is significant.
C. We have
a
1531933.12 104.4 2936.68 15.4742 and
6 1933.12 104.4
2
b
6 2936.68 104.4 153 2.3548
6 1933.12 104.4
2
Hence the equation of the regression line is: y ' 15.4742 2.3548x.
D. When the cigarette consumption 18.5, the mortality rate y ' 15.47 2.3548(18.5) 28.09.
E. Coefficient of determination = r 2 (0.971)2 0.943 . This means that 94.3% of the total variation
is explained by the linear regression model.
In multiple linear regression there are k independent variables x1 , x2 , , xk and one dependent variable
y ' and the regression equation is given by:
y ' a b1 x1 b2 x2 bk xk .
A multiple correlation coefficient R can also be computed to determine if a significant relationship exists
between the independent variables and the dependent variable. Since the computations in multiple
regression are quite complicated and for the most part would be done on a computer. We will only
consider examples with 2 independent variables and one dependent variable.
EXAMPLE 12−11
A Lecturer at USP wishes to see whether a student’s grade point average and age are related to the
students score in the final exams. He selects five students and obtains the following data.
We will use Excel for this problem, please follow the steps below:
1. Enter the data in three separate columns of a new worksheet.
2. Select Data tab on the tool bar, then Data Analysis >Regression.
Regression Statistics
Multiple R 0.984382
R Square 0.969007
Adjusted R
Square 0.938014
Standard Error 3.40005
Observations 5
1. The multiple correlation coefficient R 0.984382, which indicates that there a strong
relationship between students GPA and age with final exam score.
Note: The multiple correlation coefficient R can range from 0 to +1; it can never be negative. If
it is closer to +1, the relationship is strong and if closer to 0, the relationship is weak.
3. To test the correlation coefficient, we can use the P-value given in the output (Significance F)
which is 0.030993. Since the P-value is less than 0.05 , we reject the null hypothesis and
conclude that there is significant conclude that there is strong relationship between students GPA
and age with final exam score.
4. The multiple regression equation obtained is: y ' 39.8114 18.18575x1 2.777876x2 .
5. If a student has a GPA of 3.0 and is 25 years old, her predicted final exam score is 84.
EXAMPLE 12−12
A study was conducted, and a significant relationship was found among the number of hours a teenager
watches television per day x1 , the number of hours teenager talks on the telephone per day x2 and the
teenagers weight y. The regression equation is y ' 98.7 3.82x1 6.51x2 . Predict a teenagers
weight if she averages 3 hours of TV and 1.5 hours on phone a day.
SOLUTION
Using the regression equation, we have, y ' 98.7 3.82(3) 6.51(1.5) 119.91. The teenager’s
weight is 119.91kg if she watches 3 hours of TV and 1.5 hours on the phone per day.
EXERCISES
1. Explain the similarities and differences between simple linear regression and multiple regression.
2. Recent agricultural data in Fiji showed the number of eggs produced and the price received per
dozen for a given year.
x 5709, y 5.236, x 2
7609557, y 2
5.067302, xy 4115.025
3. A researcher has determined that a significant relationship exists among an employee’s age x1 ,
grade point average x2 , and income y . The multiple regression equation is
y ' 34127 132 x1 20805x2 . Predict the income of a person who is 32 years old and has a
GPA of 3.4.
4. The data shown below is for the car rental companies in Fiji for a recent year.
Company A B C D E F
Cars (in thousands), x 63 29 20. 8 19. 1 13. 4 8.5
Revenue (in millions), y 7.0 3.9 2.1 2.8 1.4 1.5
Using the 5% level of significance and r = 0.982, test whether the coefficient of correlation is
significant.
Objectives
After completing this chapter, you should be able to:
1. Test the distribution for goodness of fit, using chi-square.
2. Test two variables for independence, using chi-square.
13.1 Introduction
This chapter describes the hypothesis testing of categorical data based on chi-square distribution. The
distribution can be used for tests concerning frequency distribution such as, whether observed
frequencies of an experiment follow a certain pattern or theoretical distribution. This test is called, chi-
square test for goodness-of-fit. The chi-square distribution can be used to test the independence of two
attributes. For example, we can test whether two attributes ‘smoking’ and ‘cancer’ are independent.
Unlike the t-distribution, which is symmetric about the mean 0, for any degrees of freedom, the chi-square
distribution random variable 2 takes nonnegative values only and is always skewed to the right. The
general shape of chi-square distributions is shown below. It can be seen that the skewness diminishes
as the degrees of freedom ( ) increases.
1.
2.
The value of 2 which leaves an area (with d.f.) to its right is represented by 2 .
EXAMPLE 13−1
Find the value of 2 for 5 d.f. and an area of 0.025 in the right of chi-square distribution.
SOLUTION
To find the value of 2 look for = 5 and = p = 0.025 in the Table. Therefore, for d.f. = 5, the value
of 2 12.833.
Test statistic
(O E )2
2
E
Where,
O = observed frequency for a category
E = expected frequency for a category = np
Note:
1. A chi–square goodness–of–fit test is always a right–tailed test.
2. If the expected frequency of a class is too small (<5), combine it with the expected frequency of an
adjusted class.
EXAMPLE 13−2
The number of automobile accidents per week in a city is as follows: 12, 8, 20, 2, 14, 10, 15, 6, 9, 4. Are
these frequencies in agreement with the belief that accident conditions were same during this 10 weeks’
period?
SOLUTION
Hypothesis:
Critical Value:
From the Eton Table, the critical value using d. f. = 9 and = 0.05
0.05
2
= 16.919
Test value:
E
26.6
Conclusion:
Since the test value lies in the critical value, H 0 is rejected at 5% level of significance. Hence, we may
conclude that the accident conditions per week are not same.
EXAMPLE 13−3
The theory predicts the proportion of beans in the four groups A, B, C and D should be 9:3:3:1. In an
experiment with 1600 beans the numbers in the four groups were 882, 313, 287 and 118. Does the
experimental result support the theory? Use 5% level of significance.
SOLUTION
Hypothesis:
H 0 : There is no difference between experimental and theoretical results (i.e. experimental result
supports the theory that the proportions of four types of bean are 9:3:3:1).
H1 : The experimental result does not support the theory.
Critical Value:
3
1600 = 300
313 16 169 0.563
3
1600 = 300
287 16 169 0.563
1
1600 = 100
118 16 324 3.240
(O E )2
E 4.726
(O E )2
The test value is
2
E
4.726
Conclusion:
Since the test value falls in the acceptance region, we do not reject H 0 . Hence; we may conclude that
the experimental results support the theory.
A test of independence involves a test of the null hypothesis that two attributes of a population are
independent, that is,
H 0 : The attributes are independent (i.e. there is no association or relation between the attributes)
H1 : The attributes are not independent (i.e. there is association or relation between the attributes)
Test Statistic
(O E )2
2
,
E
Where,
O and E are the observed and expected frequencies, respectively, for a cell.
Degrees of Freedom
In testing independence of two attributes, the information is presented in a contingency table where one
attribute is arranged in rows and another attribute is arranged in columns.
Where,
r and c are the numbers of rows and the number of columns, respectively, in the given
contingency table.
Expected Frequencies
In an experiment on immunization of cattle from tuberculosis, the following results were obtained:
Vaccination Tuberculosis
Affected Unaffected
Inoculated 12 28
Not inoculated 13 7
Examine whether the vaccine is effective in controlling the disease at 5% level of significance.
SOLUTION
Hypothesis:
H 0 : There is no relation between the vaccination and the tuberculosis (i.e. the vaccine is not effective in
controlling the disease).
Critical Value:
Degrees of freedom: (r 1)(c 1) (2 1)(2 1) 1 . The critical value for 1 d.f. at 5% level of
significance is 0.05
2
3.841
40 35
= 23.333
28 60 21.781 0.933
20 25
= 8.333
13 60 21.781 2.614
20 35
= 11.667
7 60 21.781 1.867
(O E )2
E 6.721
(O E )2
The test value is
2
6.721.
E
Conclusion: Since the test value falls in the critical value, H 0 is rejected at 5% level of significance.
Hence, we may conclude that the vaccine is effective in controlling the disease.
To study the effect of soil condition on the growth of a new hybrid plant, saplings were planted on three
types of soil and their subsequent growth classified in three categories.
EXAMPLE 13−5
Test the hypothesis that there is an association between growth of plant and soil type. Use 1% level of
significance.
SOLUTION
Hypothesis:
H0 : There is no association between growth of plant and soil type.
H1 : There is an association between growth of plant and soil type.
38 60
= 12.324
8 185 18.697 1.517
38 60
= 12.324
14 185 2.809 0.228
68 65
= 23.892
31 185 50.524 2.115
68 60
= 22.054
16 185 36.651 1.662
68 60
= 22.054
21 185 1.111 0.050
79 65
= 27.757
18 185 95.199 3.430
79 60
= 25.622
36 185 107.703 4.204
79 60
= 25.622
25 185 0.387 0.015
(O E )2
E 13.747
(O E )2
The test value is
2
E
13.747 .
Conclusion: Since the test value falls in the critical value, H 0 is rejected at 1% level of significance.
Hence, we may conclude that there is an association between growth of plant and soil type.
EXERCISES
1. A Westpac Bank in Kiribati has an ATM installed inside the bank, and it is available to its customers
only from 8am to 3pm. The manager wanted to investigate if the number of transaction made is the
same for each of the five days (Monday through to Friday). She randomly selected one week and
counted the number of transaction made for each of the 5 days. The information she obtained is in
the table below.
Using 2.5% significance level, test the null hypothesis that the number of transaction made for each
of the 5 days is the same. Assume that this week is typical of all weeks in regards to the use of this
ATM.
2. A random sample of 300 adults was selected and they were asked if they favor school teachers
punishing students for violence and lack of discipline. Does the sample provide sufficient information
to conclude that the two attributes, gender and opinions of adults, are dependent? Use a 1%
significance level.
Gender Opinions
In Favor (F) Against (A) No Opinion (N) Total
Men (M) 93 70 12 175
Women (W) 87 32 6 125
Total 180 102 18 300
ANALYSIS OF VARIANCE
Objectives
After completing this chapter, you should be able to:
1. Use the one-way ANOVA technique to determine if there is a significant difference among three
or more means.
2. Use the two-way ANOVA technique to determine if there is a significant difference in the main
effects of interaction.
14.1 Introduction
We have studied how to compare two population means in Chapter 9. In this chapter, we develop a
method for comparing more than two population means. This method is called analysis of variance
(ANOVA). For example, a marketing specialist wishes to see whether there is a difference in the average
time a customer has to wait in a checkout line in three large self-service department stores. The specialist
will use the ANOVA technique that is a F − test to compare three or more means.
The analysis of variance that is used to compare three or more means is called a one-way analysis of
variance or one-factor design or completely randomized design since it contains only one variable
or one factor. In the previous example, the variable is the three department stores. The ANOVA can be
extended to studies involving two variables; such studies are called two-way analysis of variance.
F-distribution curves
For a test of the difference among three or more means, the following hypotheses should be used:
Although means are being compared in this F test, variances are used in the test instead of the means.
With the F test, two different estimates of the population variances are made. The first estimate is called
the between-group variance, and it involves computing the variance by using the means of the groups.
The formula for computing the between-group variance is given by:
n X X GM
2
2 i i
s ,
k 1
B
Where,
The second estimate, the within-group variance, and it involve computing the variance by using all the
data and is not affected by differences in the means. The formula for computing the within-group variance
is given by:
sW2
n 1 s
i
2
i
,
N k
Where,
si2 is the sample variance for the i th group.
sB2
F ,
sW2
Where,
d.f.N. = k – 1, where k is the number of groups.
d.f.D. = N – k, where N is the sum of the sample sizes of the groups.
A marketing specialist wishes to see whether there is a difference in the average time a customer has to
wait in a checkout line in three large self-service department stores. The times (in minutes) are shown
on the next slide. Is there a significant difference in the mean waiting times of customers for each store
using 0.05 ?
SOLUTION
H0 : 1 2 3
H1 : At least one mean is different from others. (claim)
d.f.N. k 1 2
d.f.D. N k 15
The critical value is 3.6823, obtained from the F- distribution table with 0.05.
Rejection region
0 3.6823
X GM
X
75
4.17.
N 18
Between-group variance:
n X
2
X GM
2 i i
s
k 1
B
Within-group variance:
2
s
(n i 1) si2
N k
W
Therefore,
sB2 12.5
F 2 2.7.
sW 4.6367
Step 4: Since the test value F 2.7, lies in the non-rejection region, we do not reject null hypothesis
and conclude that there is no significant difference in the mean waiting times of customers for each store.
The numerator of between-group variance is called the sum of squares between groups, denoted by
SS B and the numerator of with-group variance is called the sum of squares within groups or sum of
squares for the error denoted by SSW . Therefore,
SS B SSW
sB2 and sW2 .
k 1 N k
These two variances are sometime called mean squares, denoted as MS B and MSW . Therefore,
MS B
F .
MSW
These terms are used to summarize the analysis of variance in a table given below:
Note: Most computer programs provide ANOVA summary table as the output.
EXAMPLE 14−2
A researcher wishes to see whether there is any difference in the weight gains of athletes following one
of the three special diets. Athletes are randomly assigned to 3 groups and placed on the diet for 6 weeks.
The weight gains (in pounds) are shown here. At 0.05, can the researcher conclude that there is a
difference in the diets?
SOLUTION
H0 : 1 2 3
H1 : At least one mean is different from others. (claim)
d.f.N. k 1 2
d.f.D. N k 11
The critical value is 3.9823, obtained from the F- distribution table with 0.05.
Rejection region
0 3.6823
X GM
X
99
7.07.
N 14
Between-group variance:
n X X GM
2
2 i i
s
k 1
B
2
s
(n i 1) si2
N k
W
Therefore,
sB2 50.61
F 7.75.
sW2 6.53
Step 4: Since the test value F 7.75, lies in the rejection region, we reject null hypothesis and conclude
that there is significant difference in the diets.
EXAMPLE 14−3
A research organization tested microwave ovens. At 0.05, is there a significant difference in the
average prices of the three types of oven?
Watts
1000 900 800
270 240 180
245 135 155
190 160 200
215 230 120
250 250 140
230 200 180
200 140
210 130
A computer printout for this exercise is shown below. Use a P-value method and the information in the
printout to test the claim.
ANOVA table
Source SS df MS F p-value
Treatment 21,729.73 2 10,864.867 10.12 .0010
Error 20,402.08 19 1,073.794
Total 42,131.82 21
SOLUTION
H0 : 1 2 3
H1 :
At least one mean is different from others.
Step 2: Find the test value. From the ANOVA table, the test value is F 10.12.
Step 4: Since the P-value < , we reject null hypothesis and conclude that there is a significant
difference in the average prices of the three types of oven.
EXAMPLE 14−4
A set of data involving 4 different types of food A, B, C, & D tried on 20 Chicks is given below. All the 20
chicks are treated alike in all respects except the feeding treatments and each feeding treatment is given
to 5 randomly selected chicks. Perform an analysis of variance and test the hypothesis that the mean
weight gain is same for all the 4 foods.
The weight gain (in gm) of chicks due to the foods was recorded as:
Arranging the data and computing the sample size, mean and variance of each group we have:
Food
Food A Food B Food C Food D
55 61 42 169
49 112 97 137
42 30 81 169
21 89 95 85
52 63 92 154
n1 5 n2 5 n3 5 n4 5
X1 43.8 X 2 71 X 3 81.4 X 4 142.8
s12 185.7 s22 962.5 s32 523.3 s42 1218.2
d.f.N. k 1 3
d.f.D. N k 16
The critical value is 3.2389, obtained from the F- distribution table with 0.05.
Rejection region
0 3.2389
The sample size, mean and variance of each group are given in the table above
X GM
X
1695
84.75.
N 20
Between-group variance:
n X X GM
2
2 i i
s
k 1
B
Within-group variance:
2
s
(n i 1) si2
N k
W
Therefore,
sB2 8744.98
F 12.105.
sW2 722.43
Step 4: Since the test value F 12.105, lies in the rejection region, we reject null hypothesis and
conclude that the mean weight gain is different for all the 4 foods.
Note:
When the null hypothesis is rejected using the F-test, we conclude that the means are not equal, but we
still do not know where the difference exist. Several procedures have been developed to determine where
the significant differences in the mean lie after the ANOVA have been performed. Amongst the most
commonly used tests are the Scheffe test and the turkey test.
The two-way analysis of variance is quite complicated, and many aspects of the subject should be
considered in the two-way ANOVA. For this purpose, in this chapter only brief introduction to the subject
will be given.
In a two-way ANOVA, the researcher is able to test the effects of two independent variables or factors on
one dependent variable. In addition, the interaction effect of the two variables can also be studied.
For example, suppose a researcher wishes to test the effect of two varieties (say variety A and B) of
potatoes and two different locations (say location 1 and 2) on the yielding capacity of potatoes. The two
factors or independent variables are the varieties of potatoes and the different locations, while the
dependent variable is the yield of potatoes. The factors such as water, temperature, and sunlight are held
constant.
The two-way ANOVA has several hypotheses, that is for the above example the hypothesis are as
follows:
Variety of Potatoes:
H 0 : There are no significant differences in yielding capabilities of the 3 varieties.
H1 : There are significant differences in yielding capabilities of the 3 varieties.
Different Locations:
H 0 : There are no significant differences between the locations
H1 : There are significant differences between the locations
Interaction Effect:
H 0 : There is no interaction effect between the variety of potato and different location on the yield.
H1 : There is interaction effect between the variety of potato and different location on the yield.
Note:
1. The groups for such a two-way ANOVA are sometimes called treatment groups.
2. This design is called a 2 2 design, since each variable consists of two levels that are two
different treatments.
In the table:
The computational procedure for the two-way ANOVA is quite lengthy. For this reason, the sum of
squares will be provided in a summary ANOVA table and you should be able to interpret the table and
summarize the results.
EXAMPLE 14−5
A researcher wishes to see whether the type of gasoline used and the type of automobile driven have
any effect on the gasoline consumption. Two types of gasoline (regular and high –octane) and two types
of automobiles (2-wheel drive and 4-wheel drive) will be used in each group. There will be two
automobiles in each group, so there are 8 used in total. Analyse the data shown below, using two-way
ANOVA with 0.05.
The data (in miles per gallon) and the summary table are shown here.
Type of Automobile
Gas 2-wheel drive 4-wheel drive
Regular 26.7 28.6
25.2 29.3
High-octane 32.3 26.1
32.8 24.2
SOLUTION
H 0 : There is no difference between the means of gasoline consumption for two types of automobiles.
H1 : There is difference between the means of gasoline consumption for two types of automobiles.
H 0 : There is no interaction effect between type of gasoline used and type of automobile a person drives
on gasoline consumption.
H1 : There is interaction effect between type of gasoline used and type of automobile a person drives
on gasoline consumption.
Step 2: Find the critical values for each F-test. Factor A is the type of gasoline and it has two levels
(regular and high-octane), so a 2. Factor B is the type of automobile driven and it has two levels (2-
wheel and 4-wheel drive), so b 2. The number of data values in each group is 2, so n 2. The
degrees of freedom is given as follows:
Gasoline: a 1 2 1 1.
Automobile: b 1 2 1 1.
Interaction: (a 1)(b 1) (2 1)(2 1) 1.
Error: ab(n 1) 2(2)(2 1) 4.
Source SS d.f MS F
Gasoline 3.920 1 3.920 4.752
Automobile 9.680 1 9.680 11.733
Interaction 54.080 1 54.080 65.552
Within (error) 3.300 4 0.825
Total 70.980 7
Gasoline: F 4.752,
Automobile: F 11.733,
Interaction: F 65.552,
Gasoline:
Since the test value F 4.752, fall in the acceptance region, therefore do not reject null hypothesis and
we conclude that there is no difference between the means of gasoline consumption for two types of
gasoline.
Automobile:
Since the test value F 11.733, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference between the means of gasoline consumption for two types of
automobiles.
Interaction:
Since the test value F 65.552, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is interaction effect between type of gasoline used and type of automobile a person
drives on gasoline consumption.
EXAMPLE 14−6
A medical researcher wishes to test the effects of two different diets and two different exercise programs
on glucose level in a person’s blood. The glucose level is measured in milligrams per deciliter (mg/dl).
Three subjects are randomly assigned to each group. Analyse the data shown below, using two-way
ANOVA with 0.05.
The data (in milligrams per deciliter) and the summary table are shown here.
Diet
Exercise A B
Program
I 62, 64, 66 58, 62, 53
II 65, 68, 72 83, 85, 91
SOLUTION
Step 1: State the hypothesis.
Step 2: Find the critical values for each F-test. Factor A is the type of Exercise and it has two levels (I
and II), so a 2. Factor B is the type of diet and it has two levels (A and B), so b 2. The number of
data values in each group is 3, so n 3. The degrees of freedom is given as follows:
Exercise: a 1 2 1 1.
Diet: b 1 2 1 1.
Interaction: (a 1)(b 1) (2 1)(2 1) 1.
Error: ab(n 1) 2(2)(3 1) 8.
Source SS d.f MS F
Exercise 816.75 1 816.75 60.5
Diet 102.083 1 102.083 7.56
Interaction 444.083 1 444.083 32.9
Within (error) 108 8 13.5
Total 1470.916 11
Exercise: F 60.5,
Diet: F 7.56,
Interaction: F 32.9,
Exercise:
Since the test value F 60.5, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference in the means for the glucose levels of the persons in the two exercise
programs.
Diet:
Since the test value F 7.56, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference in the means for the glucose levels of the persons in the two diet
programs.
Interaction:
Since the test value F 32.9, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is interaction effect between type of exercise program and type of diet on a person’s
glucose level.
14.5 Summary
This chapter explains the concepts of analysis of variance (ANOVA). The concepts discussed in this
chapter are F-distribution, one-way and two-way analysis of variance.
1. The amount of sodium (in milligrams) in one serving for a random sample of three different kinds of
foods is listed below. At the 0.05 level of significance, is there sufficient evidence to conclude that a
difference in mean sodium amounts exists among condiments, cereals, and desserts?
2. How does the two-way ANOVA differ from the one-way ANOVA.
3. A contractor wishes to see whether there is a difference in the time (in days) it takes two
subcontractors to build three different types of homes. At 0.05, analyse the data shown in the
table below, using a two-way ANOVA table provided.
Source SS d.f. MS F
Subcontractor 1672.553
Home type 444.867
Interaction 313.267
Within (error) 328.800
Total 2759.487
References 211
APPENDIX A:
ANSWERS TO EXERCISES
1.
A. Descriptive
B. Inferential
C. Descriptive
D. Inferential
2.
A. Number of tutorial session a student missed
B. Statistic: mean number of missed classes for the 35 students is 2 days; parameter: the average
number of tutorial session a student missed in 2016 with the previous year’s average of 3 classes.
3.
A. Interval
B. Ordinal
C. Nominal
D. Ratio
E. Ratio
4.
A. Systematic
B. Stratified
C. Cluster
D. Random
E. Systematic
5. True
6. The confounding variable influences the dependent variable, but cannot be separated from the
independent variable.
7.
A. Discrete
B. Continuous
C. Discrete
D. Continuous
8.
A. Quantitative
B. Qualitative
C. Qualitative
D. Quantitative
E. Quantitative
9.
A. Observational
B. Observational
C. Experimental
1. A. & B.
f 25
C. 56%
B
AB 28%
16%
D.
8
Frequency
6
4
2
0
A B AB O
Blood Type
E.
C.
Histogram
8
7
6
Frequency
5
4
3
2
1
0
11.5 18.5 25.5 32.5 39.5 46.5
Amount of protein (g)
Frequency Polygon
8
7
6
Frequency
5
4
3
2
1
0
8 15 22 29 36 43 50
Ogive
25
Cumulative freq.
20
15
10
0
11.5 18.5 25.5 32.5 39.5 46.5
3.
A.
Stem Leaf
4 58
5 245889
6 11245667
7 0357789
8 02366
9 15
B. 16/30 = 8/15
C. In the 60s
D. Approximately symmetric with the peak in the 60s.
X
X
13857
1154.8
n 12
947 956
Median 951.5
2
Mode 856
X X
2
X
2215 1124130.06
1888 537655.56
1477 103845.06
1059 9168.06
977 31595.06
956 39501.56
947 43160.06
924 53245.56
899 65408.06
856 89251.56
856 89251.56
803 123728.06
13857 2309440.25
X X
2
2309440.25
s2 209994.57 s 209994.57 458.25
n 1 11
B.
Q1 or P25 is obtained by
25(12)
P25 th term
100
3rd term
The value of 3rd term can be approximated by the average of 3rd and 4th terms in the ranked
data. Therefore,
856 899
Q1 877.5
2
The value of 4.8th term can be approximated by the 5th term in the ranked data.
Therefore,
P40 924
6 0.5
Percentile rank of 956 100% 54.2%.
12
C.
Step 3: Check the data set for any data values that fall outside the interval from 291.75 to
1853.75. Since the data values 1888 and 2215 are outside this interval, it can be considered an
outlier.
D.
The Five-Number Summary (Note: The data should be arranged in ascending order first)
1. The lowest value is 803;
2. Q1 877.5 ;
3. The median is 951.5;
4. Q3 1268 ;
5. The highest value is 2215;
Since the median is to the left of the center of the box or the right line is larger than the left line, the
distribution is positively skewed.
2.
Number of
Frequency ( f ) Midpoints ( X m ) fX m f X m2
Employees
A.
fX m
2305
20.95
N 110
100
cumulative percentage
80
60
40
20
0
0.5 10.5 20.5 30.5 40.5 50.5
Number of people
B.
fX
2
f X (2305)2
m
2
m 70627.5
2 N 110 202.98 202.98 14.25
N 110
Hence,
1(73) 1(67) 2(85)
Weighted mean 77.5
11 2
4.
A. Firm B has a larger wage bill since it has the wage bill 200x$185=$37000, while Firm A has the
wage bill 100x$196=$19600
1.
2.
A. Classical, Empirical and Subjective.
B. Empirical, an experiment is performed.
3.
A. Simple event, since a simple event is an event with only one sample point.
B. A compound, since a compound event is an event with more than one sample point.
4.
A. C and S are non-mutually exclusive events since P C S 0
B.
P C S 0.85, P( S ) 0.61, P(C ) 0.31,
P(C S ) P(C ) P( S ) P C S 0.31 0.61 0.85 0.07
C.
5. Let the events, D = card is Diamond, Q = card is Queen, A= card is 3 and B= card is 6
A. P Q 4 52
B. P A D 1
52
C. P( A D) P( A) P( D) P A D 4 52 13 52 152 16 52 413
D. Since P A B 0 , P A B P( A) P( B) 4 52 4 52 8 52 213
A. P F 10
13
B. PN F 7
13
C. P N F P( N ) P( F ) P N F 8 10 7 11
13 13 13 13
7.
A. 13
50
B. Since each die can land in six different ways, and two dice are rolled, the sample space can be
presented by a rectangular array as follows:
Die 2
3. Die 1
1 2 3 4 5 6
C. The empirical probability value is quite different from the theoretical or classical probability value
due to the fact that the number of trial in the experiment to determine the empirical probability is
small. If this trial number increases, the empirical probability value will tend to approach or getting
closer to the theoretical probability value.
1. Let P = guinea pig is pregnant and P' = guinea pig is not pregnant. Note that the events of picking
the first pig will affect the second and the third picks as well, hence, the events are dependent.
5 4 3 5
A. P( P P P) 0.179
8 7 6 28
5 4 3 15
B. P( P P P ') 3 0.536
8 7 6 28
C. P(atleast one pig is pregnant) 1- P(none are pregnant) 1- P( P ' P ' P ')
3 2 1 55
1- 0.982
8 7 6 56
2. Let C = student own car and C' = student does not own car.
A. P(C C C ) 0.1 0.1 0.1 0.001
C. P(atleast one student own a car) 1- P(none own cars) 1- P(C ' C ' C ')
3. Let W = student works; W' = student does not work; M = student is male and F = student is female;
A.
250
i. P(W ) 0.625
400
250 180 120 310
ii. P(W M ) P(W ) P(M ) P(W M ) 0.775
400 400 400 400
90
iii. P( F W ') 0.225
400
P(W ' M ) 60
iv. P(W ' | M ) 0.333
P( M ) 180
60
B. Not mutually exclusive events since P( M W ') 0.
400
C. Dependant events since P( F W ') P( F ) P(W ')
B. 5! = 120.
C. 4! = 24.
8.
9 9 9
A. 381024 .
4 3 2
9 5 2
B. 1260 .
4 3 2
9.
C5 6C0 126
9
C2 6C3 720
9
10 10
1. c 4c 9c 16c 1 30c 1 c 1
30
2.
A. P( X 2) 0.35
B. P( X 2) P( X 2) P( X 3) 0.35 0.3 0.65
C. P( X 1) P( X 1) P( X 0) 0.2 0.15 0.35
2 X 2 .P( X ) 2 02 (0.15) 12 (0.2) 22 (0.35) 32 (0.3) 1.82 4.3 1.82 1.06
1.06 1.03
4.
A.
X 0 1 2 3
P( X ) 0.008 0.096 0.384 0.512
5.
Joe’s gain X $4 −$1
P(X) 6/36 30/36
6 30 1
E ( X ) 4 1 $0.17
36 36 6
loss in playing 15 games is 15 $0.17 $2.50
7. Let X be the number of customers having purchased shoes. In this case: n = 20, p=0.3, and q = 0.7
and P( X 0) 20C2 (0.3)0 (0.7)20 0.0008; P( X 1) 20C2 (0.3)0 (0.7)20 0.0068;
Therefore,
P( X 2) 1 P( X 0) P( X 1) 1 (0.0008 0.0068) 0.9924
8. Let X be the number of articles submitted for publication. In this case: n = 8, p=0.11, and q = 0.89.
A. P( X 4) 8C4 (0.11)4 (0.89)4 0.00643
9. Here n=400, p=0.03, and q=0.97 and using the formulas, we have
n p 400 (0.03) 12
2 n p q 4 (0.03) (0.97) 11.64 11.64 3.41
1.
A.
−0.21 1.57
P (−0.21 < Z < 1.57) = 0.0832 + 0.4418 = 0.525.
B.
1.43
2.
A. ZO 1.16.
B. ZO 2.101.
86 65
B. P( X 86) P z P z 2.1 0.5 0.4821 0.0179 1.8%
10
4.
A. P( X 6200) P( z 0.25) 0.4
6200 6300
B. P( X 6200) P( z P( z 1.58) 0.0571
400 / 40
5. 51, 14;
A. P 58 X 65 P 0.5 z 1
0.3419 0.1915
0.1498
0.1498 200 30students.
B.
z 1.64 or 1.65
Area = 0.05 x 51
1.64 x 74
14
0 z
Therefore 74 is the minimum mark to obtain an A+.
1. Confidence level is the probability that the interval estimate will contain the parameter and confidence
interval is a specific interval estimate of a parameter determined from the data obtained from a
sample and using specific confidence level.
2. Given n 8, X 13.1 and s 4.1. Since is unknown and n 30 , we use t /2 in the formula.
Using d . f 7 and 0.05, we get t /2 2.365. Hence the 95% confidence interval of is
4.1 4.1
13.1 2.365 13.1 2.365
8 8
9.7 16.5.
3. Given 900 and E 5. For 99% confidence level, we have z /2 2.58. Hence the minimum
sample size is
2.58 900
2
5. Given that X 26.1, 4.2 and n 30. For 99% confidence level, we have z /2 2.58. Hence,
4.2 4.2
5. 26.1 2.58 26.1 2.58
30 30
26.1 1.98 26.1 1.98
6.
24.12 28.02
6. Given that pˆ 0.29, qˆ 0.71 and E 0.05. For 90% confidence level, we have z /2 1.65.
Hence,
2
z
2
1.65
ˆ ˆ 2 0.29 0.71
n pq 224.23 225.
E 0.05
p pˆ z 2
ˆˆ
pq
0.314 2.58
0.134 0.686 0.314 0.054 0.260 p 0.368
n 500
B. Given that E 0.02. We have
2
z
2
2.58
ˆ ˆ 2 0.314 0.686
n pq 3584.5 3585
E 0.02
8. Given that n 995, pˆ 0.291 and qˆ 0.709. For 90% confidence level, we have z /2 1.65 .
0.291(0.709) 0.291(0.709)
0.291-1.65 p 0.291-1.65
995 995
0.291 0.0238 p 0.291 0.0238
0.2672 p 0.3148
1. The null hypothesis is a statistical hypothesis that states there is no difference between a parameter
and a specific value or there is no difference between two parameters. The alternative hypothesis
specifies a specific difference between a parameter and a specific value, or that there is a difference
between two parameters. For example, H 0 : 5 and H1 : 5.
2.
H 0 : 9.5hrs
A.
H1 : 9.5hrs (two-tailed test)
H 0 : $105
B.
H1 : $105 (left-tailed test)
H 0 : $39000
C.
H1 : $39000 (right-tailed test)
H 0 : 10mins
D.
H1 : 10mins (left-tailed test)
Critical Critical
region region
Acceptance
region
−1.96 1.96
Step 3: Compute the test statistics value. We find that X 29.45, 29, 2.61 and n 30.
29.45 29
Therefore, z 0.944
2.61
30
Since the test value z 0.944 , falls in acceptance region, the decision is: “Do not reject H 0 ”.
Step 5: Summarize the results.
It cannot be concluded that the average height differs from 29 inches.
4.
A.
Step 1: State the hypothesis.
H 0 : 200 (claim)
H1 : 200
Critical Critical
region region
Acceptance
region
−1.96 1.96
Step 3: Compute the test statistics value. Given that X 198.2, 200, s 3.3 and n 40.
198.2 200
Therefore, z 3.45
3.3 / 40
Step 4: Make a decision
Since the test value z 3.45 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results.
There is enough evidence to reject the claim that adult dogs fed a special diet will have weight of 200
Ibs.
We know that is unknown but n 30 , we use z /2 in the formula. For 95% confidence level,
we have z /2 1.96 . Given that X 198.2, s 3.3 and n 40. The confidence interval of
3.3 3.3
is 198.2 1.96 198.2 1.96 197.18 199.22
40 40
Step 2: Find the test value. We find that X 985, 980, 15 and n 150.
X 985 980
Therefore, z 2.357
n 15 150
−2.357 2.357
Critical
Acceptance
region
region
1.65
Step 3: Compute the test statistics value. Given that X 3120, 3000, 578 and n 60.
3120 3000
Therefore, z 1.61 .
578 / 60
Since the test value z 1.61 , falls in acceptance region, the decision is: “Do not reject H 0 ”.
Step 2: Find the test value. We find that pˆ 63 / 200 0.315, p 0.25, q 0.75 and n 200.
Therefore,
0.315 0.25
z 2.12 .
(0.25)(0.75) / 200
2.12
The area on the right of z 2.12 is 0.0170. Since it is a right- tailed test, the P-value is 0.0170.
Step 4: Make a decision
Critical Critical
region region
Acceptance
region
-2.58 2.58
Since the test value z 1.76 , falls in acceptance region, the decision is: “Do not reject H 0 ”.
Step 5: Summarize the results.
Therefore, it does not suggest a difference from the national proportion.
3.
A. Use Confidence Interval Method.
H 0 : 3.18 (claim)
H1 : 3.18
We know that is unknown and n 30 , we use t /2 in the formula. Since 005 and the
test is two–tailed, so the area on the left tail and right tail are 0.05/2 = 0.025. Using the t-
distribution table with d . f 23 and 005 (or 2 p 0.05) , we find that t 2 2.069 .
We find that X 3.833, s 1.435 and n 24. The confidence interval of is:
1.434563 1.434563
3.833 2.069 3.833 2.069 3.23 4.44
24 24
Since the test value t 2.23 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results.
We conclude that the average family size differs from the national average.
Step 2: Find the confidence interval. We know that that X 685, s 125, n 400. And for 98%
confidence level, we have z /2 2.33 . Thus, the 98% confidence interval for is
125 125
685 2.33 685 2.33 638.4 731.6
400 400
H0 : p 0.35 (claim)
H1 : p 0.35
Step 2: Find the critical value.
Since 0025 and the test is left–tailed, so the area on the left tail is 0.025.The z-value is
z 1.96 .The critical value is z 1.96 See the diagram below.
Critical
region
Acceptance
region
7. −1.96
0.28 0.35
Therefore, z 2.935
0.35 0.65
400
Since the test value z 2.935, falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results.
It can be concluded that the company should not market this yogurt.
H0 : 1 2
H1 : 1 2 (Claim)
−2.58 2.58
Since the test value z 3.04 , falls in rejection region, the decision is: Reject H 0 ”.
Step 5: Summarize the results.
Therefore, it can be concluded that there is difference in mean earnings between male and female
college graduates.
We know that 1 and 2 are unkown and unequal, we use t-test. Since 005 and the test is
two–tailed, so the area on the left tail and right tail are 0.05/2 = 0.025. Using the t-distribution table
with d . f 24 and 005 (or 2 p 0.05) , the critical values are t 2.797 . See the
diagram below.
Critical Critical
region region
Acceptance
region
−2.797 2.7967
Step 3: Compute the test statistics value. We know that x1 223, n1 30, s1 6.1 and
223 229 0 3.731
x2 229, n2 25, s2 5.8. Therefore, t
6.12 5.82
30 25
Step 4: Make a decision
Since the test value t 3.731 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results
There is enough evidence to support the claim that there is significant difference in cholesterol
levels between the two groups.
Step 2: Find the test value. We know that x1 68.2, n1 20, 1 2.5 and
68.2 67.5 0 0.834
x2 67.5, n2 20, 2 2.8. Therefore, z .
2.52 2.82
20 20
Step 3: Compute the P-value.
0.834
Since the P-value is greater than 0.05, the decision is “Do not reject H0 ”
Step 5: Summarize the results.
There is enough evidence to reject the claim that the athletes are taller than non-athletes.
Step 2: Find the confidence interval. We know that x1 68.2, n1 20, 1 2.5 and
x2 67.5, n2 20, 2 2.8. And for 95% confidence level, we have z /2 1.96 . Thus, the 95%
confidence interval for is
68.2 67.5 1.96 2.52 2.82 68.2 67.5 1.96 2.52 2.82
20 20 1 2
20 20
0.945 1 2 2.345
Since the confidence interval contains the hypothesized value 1 2 0, the decision is: “Do not
reject H 0 ”.
−2.33 2.33
Step 3: Compute the test statistics value. We know that x1 39420, n1 35, s1 1659 and
x2 30215, n2 40, s2 4116.
39420 30215 0 12.99
Therefore, z
16592 41162
35 40
Since the test value z 12.99 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results
There is enough evidence to conclude that there is significant difference between the two states
chemists’ salaries.
We know that 1 and 2 are unkown and unequal, we use t-test. Since 005 and the test is
right–tailed, so the area on the right tail is 0.05. Using the t-distribution table with d . f 23 and
005 (or p 0.05) , the critical values are t 1.714 . See the diagram below.
Critical region
Acceptance
region
1.714
Therefore, t
48256 45633 0 1.92.
3912.42 55332
26 24
Since the test value t 1.92 , falls in rejection region, the decision is: “Reject H 0 ”.
1. Simple regression has one dependent and one independent variable whereas multiple regression
has one dependent variable and two or more independent variables.
2.
Since 005 and the test is two–tailed, so the area on the left tail and right tail are 0.05/2 = 0.025.
Using the t-distribution table with d . f 4 and 005 (or 2 p 0.05) , so the critical values are
t 2.766 . See the diagram below.
−2.766 2.766
Step 3: Compute the test statistics value. We know that r 0.833 and n 6. Therefore,
62
t 0.833 3.01
1 (0.8332 )
Since the test value t 3.01 , falls in rejection region, the decision is: “Reject H 0 ”.
Step 5: Summarize the results.
There is a significant relationship between the number of eggs produced and price per dozen.
a
5.236 7609557 5709 4115.025 1.252
and
6 7609557 5709
2
D. The coefficient of determination, r 2 0.833 0.694. This means that 69.4% of the total
2
E. When x 1600 million eggs, the price per dozen is y ' 1.252 0.000398(1600) 0.615
per dozen
3. When person is 32 years old, x1 32 , and has a GPA of 3.4, x1 3.4, the income is
y ' 34127 132(32) 20805(3.4) 40834.
4.
Step 1: State the hypothesis.
H 0 : 0 (There is no significant relationship between the variables)
H1 : 0 (There is significant relationship between the variables)
Since 005 and the test is two–tailed, so the area on the left tail and right tail are 0.05/2 = 0.025.
Using the t-distribution table with d . f 4 and 005 (or 2 p 0.05) , so the critical values are
t 2.766 . See the diagram below.
−2.766 2.766
Step 3: Compute the test statistics value. We know that r 0.982 and n 6. Therefore,
62
t 0.982 10.4
1 0.9822
Since the test value t 10.4 , falls in rejection region, the decision is: “Reject H 0 ”.
1.
Step 1: State the hypothesis.
H 0 : The number of transaction made for each of the 5 days is the same.
H1 : The number of transaction made for each of the 5 days is not the same.
If H 0 is true, the expected number of transaction made for each of the 5 days is the same.
E = The expected number of transaction made per day= Total number of transaction made =
No. of days
1200
240
5
(O E )2
The test value is
2
= 23.183
E
Step 4: Make a decision and summarize the results.
Since the test value 2 23.183, falls in rejection region, the decision is: “Reject H 0 ”.
Therefore, we conclude that the number of transaction made using this ATM for each of the 5 days
is not the same.
2.
Step 1: State the hypothesis.
(O E )2
The test value is
2
= 8.24
E
Since the test value 2 8.24, falls in acceptance region, the decision is: “Do not reject H 0 ”.
Therefore, we conclude that the two attributes, gender and opinions of adults, are independent.
1.
Step 1: State the hypothesis and identify the claim.
H0 : 1 2 3
H1 : At least one mean is different from others. (claim)
The critical value is 3.5219, obtained from the F- distribution table with 0.05.
Rejection region
0 3.5219
n1 7 n2 7 n3 8
X GM
X 4780 217.273.
N 22
n X X GM
2
2 i i
s
k 1
B
Within-group variance:
sW2
(n 1)s i
2
i
N k
6(5695.238) 6(3928.571) 7(7335.714)
667
5741.729
Therefore,
sB2 13771.799
F 2.3985.
sW2 5741.729
Step 4: Since the test value F 2.3985, lies in the acceptance region, the decision is: “Do not
reject H 0 ”.
Step 5: There is not enough evidence to support the claim that there are difference in mean sodium
amounts exists among condiments, cereals, and desserts.
2. The two- way ANOVA allows the researcher to test the effects of two independent variables and a
possible interaction effect. The one-way ANOVA can test the effects of only one independent
variable.
3.
Step 1: State the hypothesis
Subcontractor: d . f .N a 1 2 1 1.
Home type: d . f .N b 1 3 1 2.
Interaction: d . f .N (a 1)(b 1) (2 1)(3 1) 2.
Error: d . f .D ab(n 1) 2(3)(5 1) 24.
Step 3: Complete the ANOVA table and compute the test values.
Source SS d.f. MS F
Subcontractor 1672.553 1 1672.553 122.084
Home type 444.867 2 222.4335 16.236
Interaction 313.267 2 156.6335 11.433
Within (error) 328.800 24 13.7
Total 2759.487 29
Subcontractor:
Since the test value F 122.084, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference between the means of days taken by two subcontractors to build.
Home type:
Since the test value F 16.236, fall in the rejection region, therefore reject null hypothesis and we
conclude that there is difference between the means of days taken to build three types of home.
Interaction:
Since the test value F 11.433, fall in the rejection region, therefore reject null hypothesis and we
conclude that there interaction effect between the home type and subcontractors on the days to build.