Professional Documents
Culture Documents
BUSINESS STATISTICS
1
Module Book
BUSINESS STATISTICS
2
Table of Contents
PAGE
INTRODUCTION 4
Session 5: Probability 74
APPENDICES
- Appendix 1: List of Formulas 203
- Appendix 2: Standard Normal Table 206
- Appendix 3: t-distribution Table 207
- Appendix 4: Chi-square distribution Table 208
- Appendix 5: F-distribution Table 209
3
Module Book
BUSINESS STATISTICS
INTRODUCTION
Content
Module Aims
Learning Outcomes
(i) Organizing and portraying statistical data using tables and graphical techniques to
convey practical meanings.
(ii) Calculating probabilities of data sets.
(iii) Establishing associations between variables, so as to perform correlations and
estimations.
4
3. Show cognitive skills with respect to:
(i) Obtaining essential knowledge and techniques on methods for selecting samples from
population, as well as making statistical inferences about the population.
(ii) Applying statistical techniques to quantify information, analyze data, interpret results
and make sound decision-making.
(iii) Utilise Microsoft Excel to analyse and solve statistical problems.
5
SESSION TOPIC LEARNING OUTCOMES TEXT
1. Introduction to Statistics Students should be able to: Business
• Definition of Statistics - Understand what is meant by Statistics Statistics
• Descriptive versus Inferential - Describe elements comprising the Module
Statistics decision-making process Book
• Population vs. Sample - Understand descriptive and inferential Session 1
• Types of Variables statistics
- Qualitative vs. Quantitative - Differentiate between population and
Variables sample
- Discrete vs. Continuous - Distinguish the types of variables
Variables being studied & their levels of
• Level of Measurements measurement
• Sampling Methods & Biasness - Identify the various methods of data
Involved collection
- Reasons for Sampling - Briefly describe various sampling
- Sampling and Data methods
Collection Methods - Identify various ways that biasness
- Bias in Statistics could occur
6
SESSION TOPIC LEARNING OUTCOMES TEXT
5. Probability Students should be able to: Business
• The Language of Probability - Define basic terms used in probability, Statistics
• Probability Rules namely experiment, outcome/event Module
• Addition Rule and sample space. Book
• Multiplication Rule - Understand mutually exclusive and Session 5
• Conditional Probability independent events
• Bayes Theorem - Draw Venn Diagrams for computing
• Discrete Probability probabilities
Distribution - Understand and apply basic addition
and multiplication rules and special
addition and multiplication rules in
computing probabilities
- Understand and compute conditional
probabilities
- Apply Bayes’ theorem and draw Tree
Diagrams
- Compute the mean, variance and
standard deviation of a Discrete
Probability Distribution.
6. Use of EXCEL for Data Analysis 2 Students should be able to: Business
- Perform data analysis using Excel Statistics
Module
Book
Session 6
7. Linear Regression and Correlation Students should be able to: Business
• Relationship between Two - Construct and interpret scatter plots of Statistics
Quantitative Variables: bivariate quantitative variables Module
Correlation and Regression - Identify types of relationships between Book
Analysis two quantitative variables Session 7
• Analysing Associations with - Fit a regression equation using least
EXCEL squares method
• Limitations of regression - Interpret the slope and y-intercept in the
analysis regression equation
- Calculate and interpret the correlation
coefficient
- Calculate and interpret the coefficient of
determination
- Understand the limitations of linear
regression
7
SESSION TOPIC LEARNING OUTCOMES TEXT
9. Estimation Students should be able to: Business
• Types of Point Estimates - Explain the difference between a point Statistics
• Confidence Interval for a estimate and an interval estimate Module
Population Mean - Use normal distribution to construct a Book
• Confidence Interval for a confidence interval for population mean Session 9
Population Proportion and proportion
• Factors Influencing Confidence - Use t distribution to construct a
Interval Width confidence interval for population mean
• Sample Size Determination - Decide whether normal or t distribution
should be used in constructing
confidence interval for population mean
- Understand the factors influencing
width of Confidence Interval
- Determine a sample size at specified
levels of confidence and margin of error
11. Analysis of Categorical Data: Chi- Students should be able to: Business
Square Test of Independence - Organize categorical data into a Statistics
• Contingency Table Analysis contingency table Module
• Exploring relationship between - Set up appropriate null and alternative Book
two qualitative (categorical) hypotheses Session 11
variables - Compute expected frequencies, degrees
of freedom from a given contingency
table.
- Apply chi-square distribution to perform
a test of association
- Understand precautions about use of chi-
square
8
Teaching and Learning Methods
Participants will learn through a combination of lectures and practical activities. Participants
will be expected to learn independently by carrying out reading and directed study beyond
that available within taught classes.
Indicative Readings
Recommended Text Tan Suat Pheng, Business Statistics Module Book, SIM Global
Education, 2020
Assessment/coursework
All assessments must comply with the SIM Rules and Regulations. To satisfy requirements,
students must:
1) Satisfactorily complete and present on due dates their completed assignment. A penalty of
20% of the total marks will be imposed for late submission. A submission made later than
1 calendar day past deadline will receive a zero mark.
2) Complete all assignments and the final examination in a satisfactory manner.
3) Reference all their work and observe SIM’s policy on plagiarism. Students found guilty of
plagiarism will be dealt with severely.
4) Adopt either the Harvard or APA (American Psychological Association) Referencing
Styles.
5) Spend at least 100 hours (including class attendance and assignments) on the module in
order to fare reasonably.
Calculators
Only non-programmable calculators (including non-programmable scientific) are permitted in
examinations. Listed below are some models that students can use:
Casio
FX82MS FX85MS FX95MS FX82ES PLUS
Sharp
EL509WS EL506W EL-570ES Plus
9
BUSINESS STATISTICS
SESSION 1
INTRODUCTION TO STATISTICS
1. Introduction
In common usage, many would refer to Statistics as numerical facts, for example average
starting salary of graduates or average number of cars sold in a month. Some others refer to
it as a way of collecting and displaying large amounts of numerical information. And to still
another group it is a way of “making decisions in the face of uncertainty.” Each of these point
of view is correct.
Every day, we make decisions that may be personal or business related. Many a time, the
situation or problems that we face in the real world have no precise or definite solutions. It is
from this perspective of informed and more effective decision making that we consider why we
need to know about statistics. Data are collected everywhere and require statistical knowledge
to make the information useful.
To study statistics, we need to be able to speak its language. We will define some basic terms
commonly used in statistics and in research.
Population
The entire set or collection of people or objects of interest. Example: the entire collection of
students in a university. The population that is being studied is also known as the target
population.
10
Sample
A subset or portion of the population. Example: 10% of the students in the university were
surveyed. A sample that is chosen to represent the characteristics of the population as closely
as possible is known as a representative sample.
Parameter
A numerical measure that is computed to describe a characteristic of an entire population.
Example: The average age of all students who were admitted to ABC university this year was
21 years. The value 21 is a parameter.
Statistic
A numerical measure that is computed to describe a characteristic from a sample. Example:
The average height of 25 randomly selected female students was 1.6 metres. The value 1.6 is
a sample statistic.
Census
A survey that includes every member in the population. Example: 100% of households in
Singapore are surveyed once every 10 years.
Example 1.1
Indicate which of the following refers to a population and which refers to a sample:
2. Types of Statistics
Broadly speaking, statistics can be divided into two areas: Descriptive statistics and Inferential
statistics:
11
2.1 Descriptive Statistics
When data are first collected, they are known as raw data. Raw data sets can be very large. This
makes it difficult to draw conclusions or make decisions with data in its original form. To have
a better understanding of the data, we can organise or tabulate the data, construct charts or
graphs and compute some summary measures. The portion of statistics that help us do these
tasks is known as descriptive statistics.
A major portion of statistics deals with drawing conclusions, predictions and generalisations
about the population based on results obtained from samples. For example, we draw
conclusions about the satisfaction of all customers of a restaurant by surveying say, 100
customers. A quality control manager may inspect randomly selected products from a batch of
production to make a decision about the quality of products from that production run.
Inferential statistics can be defined as those methods that make possible the
estimation of characteristics of a population based only on sample results.
Example 1.2
Which branch of statistics do they belong to?
3. Types of Variables
There are two basic types of variables namely Qualitative and Quantitative. (See Figure 1.2)
12
Figure 1.2 Types of Variables
Quantitative variables yield numeric responses. Examples of quantitative variables are weight,
height and number of siblings.
A variable that can be measured numerically is known as a quantitative
variable. Data collected from a quantitative variable are known as quantitative
data.
Quantitative variables can be subdivided into two classifications: Discrete and Continuous.
Although there could be exceptions, the only distinction that we will make here is that a discrete
variable arises from counting while a continuous variable arises from measuring.
A variable that can assume any value over a specified range is known as a
continuous variable.
You measure height, weight, amount spent on books and travelling time. Hence, these are
continuous variables.
There may be some variables which appear numeric but should be classified as qualitative
variables. Examples are mobile phone number and car registration number. These are merely
identification numbers. They do not measure anything and you will not be able to do
mathematical computations on these variables. Hence, they are not quantitative variables.
13
Example 1.3
Classify the following variables as Quantitative (state Discrete or Continuous) or Qualitative:
Variable Variable Type
Quantitative,
(a) Amount spent on clothing last month.
continuous
(b) Favourite shopping mall. Qualitative
Quantitative,
(c) Time taken to serve a bank customer.
continuous
Number of subjects taken by a student this Quantitative,
(d)
semester. discrete
4. Scales of Measurement
Data that are qualitative or categorical have a nominal scale of measurement. Numbers are for
identification purpose and have no mathematical meaning. Examples are Colour of a car and
Preferred brand for a product.
The nominal level applies to data that are categorised and these categories
are used for identification purpose only.
We can neither rank the categories nor do any mathematical operations (such as addition,
subtraction, multiplication or division). For example, for the variable Gender, we can assign
codes 1=Male and 2=Female. The codes have no mathematical meaning as we cannot say code
1 is superior to code 2.
14
4.2 Ordinal level
Data that has some order or can be ranked have an ordinal level of measurement.
The ordinal level applies to data that are categorised and these categories can
be ranked.
In a survey, people were asked to rate the service at a restaurant as excellent, good or poor.
These categories possess the characteristic that can be ranked. Excellent has the highest rank
and poor has the lowest rank. So, we have 1= Excellent, 2= Good and 3=Poor. Hence, we do
know that code 1 is more superior than code 2 and code 3. An important characteristic of using
an ordinal scale is that we cannot distinguish the magnitude of the difference between the rating.
We do not know if the difference between “excellent” and “good” is the same as the difference
between “good and “poor.
Data that are numeric and for which the difference between two values are meaningful are said
to have an interval level.
Data with an interval scale contain a zero point but it does not mean absence of an attribute.
Examples of variables with an interval scale are temperature, intelligent quotient(IQ) and shoe
size. For temperature, for example, a zero value does not represent absence of warmness. In
fact, by our own measurement, it is cold! A zero IQ does not mean a person has no intelligence.
There is also no natural zero for size (shoe size, dress size etc.)
The interval level applies to data that can be ranked and the differences
between the two values can be calculated and interpreted.
The difference between 2 values for an interval scale variable can be interpreted. For example,
in an IQ test the difference between someone who scores 120 and someone who scores100 is
20. This difference is same as the difference between a score of 90 and 70. However, a
characteristic of data with an interval scale is that ratio does not make sense for such data. A
person who scores 120 in an IQ test is not twice as intelligent as a person who scores 60. Neither
is a temperature of 400 Celsius twice as warm as 200 Celsius. The foot length of a person who
wears shoe size of 4 is not half that of someone who wears shoe size 8.
The ratio level is the “highest” scale of measurement. Almost all quantitative variables are
recorded on the ratio scale.
Ratio scale applies to data with known units of measurement and all
arithmetic operations (addition, subtraction, multiplication and division) can
be done with meaningful interpretation.
15
Examples of variables with ratio scale of measurement are income, sales and weight. A zero
point has a meaning in ratio scale data. If you have zero dollars, it means you have no money.
A salesperson with zero sales means he did not sell any product.
Example 1.4
Scale of
Variable
Measurement
(a) Number of bedrooms in an apartment. Ratio
(b) Favourite car colour. Nominal
(c) Time taken to complete an assignment. Ratio
Rating of hotel service (1=Excellent to 5=
(d) Ordinal
Poor)
Highest education level (1=Completed
Primary education, 2= Completed
(e) Ordinal
Secondary education and 3=Completed
Tertiary education)
(f) Today’s temperature in Melbourne. Interval
5 Sampling Methods
As mentioned in Section 1.2, a sample is a subset of a population that is selected for analysis.
Rather than taking a complete census of the whole population, statistical sampling procedures
focus on a small representative group of a larger population.
Sampling methods can be classified under Probability and Non-Probability sampling methods.
(see Figure 1.4)
Probability Sampling
In Probability Sampling methods, the researcher selects random members from a population by
setting a few selection criteria. These selection criteria allow units to have a known chance
(not necessarily equal) of being selected.
Non-Probability Sampling
Non-probability sampling methods are reliant on a researcher’s ability to select members.
Hence, not every unit or person has a chance of being included in the sample.
16
Figure 1.4 Sampling Methods
Convenience Sampling
This method is dependent on the ease of accessibility to units that you wish to survey. For
example, surveying passers-by in a busy street on their opinion of a new policy to be
implemented by the government.
Judgement Sampling
A sample is selected by the discretion of the researcher based on personal judgement about the
group of people who own qualities that a researcher expects from the target population.
For example, in the case of day-to-day business problems or public-policy creation, the
judgement sampling may be the only practical method that can be used to take the actions
immediately on the basis of estimates that are readily available with the businessmen and public
officials.
Assume you have a population of 50 persons and would like to select 5 persons at random.
17
Table 1.1 Random Number Table
(2) Since the population size is a two-digit number, we will use the first two digits of the
numbers listed in the table.
(3) Start at any value in the table. Assume we land on 08 (see Table 1).
(4) The second number will be 47, the third is 02 and so on. If a number is not within the
range of 01 to 50, discard it. Continue until you find 5 of the numbers whose first two
digits are less than or equal to 50.
(5) From this table, we arrive at 08, 47, 02, 11, and 38.
(6) Result: Persons 08, 47, 02, 11, and 38 will be used for our random sample.
Systematic Sampling
In systematic sample, units of a sample are chosen at pre-defined fixed intervals. Some
examples are:
- A professor selects every 10th person from the list of names in the student register to
attend a seminar. It requires selection of a starting point for the sample.
- An auditor selects every 5th purchase order in a file for checking.
- A researcher decides to interview every 20th person leaving a home exhibition.
18
Stratified Sampling
This is a sampling method where the population is divided into small groups (strata) that do not
overlap but represent the entire population. Units are then selected from each group
proportionately to form the sample. For example, a credit card company may create strata based
on income level – “less than $40,000”, “$40,000 to $60,000” etc. to study the type and level of
spending of credit card holders. Marketers can analyse which income groups to target to
formulate appropriate marketing strategies.
Assume a car distributor has 1000 customers last year spread out among various age groups. It
wishes to select a stratified sample of 100 customers for a survey. First, the distributors will
find the percentage representation of each stratum in the population. Using these percentages,
a proportionate number will be sampled from each stratum.
Cluster Sampling
Cluster sampling is a method where the researcher decides on some criteria to divide the entire
population into groups or clusters that represent the population. A sample will be selected from
one or two of these clusters. Some or all members in the selected clusters will be surveyed.
Assume the housing board wishes to do a study of the expenditure patterns of HDB households
in Singapore. The population is already divided based on HDB estates. (assuming there are 20
such estates or clusters).
H1 H5 H9 H13 H17
H2 H6 H10 H14 H18
H3 H7 H11 H15 H19
H4 H8 H12 H16 H20
The clusters are homogeneous and each cluster represents the population well. The housing
board then selects any 2 clusters, say H1 and H5. Households in these 2 clusters were then
interviewed. Households in all other clusters are excluded from the survey.
If the number of households in the two clusters are too large, simple random sampling can then
be carried out within each cluster. This is known as multi-stage sampling.
19
Example 1.5
Identify the sampling method used in the following situations:
Sampling
Description
Method
The database of a large hospital contains records
of 10,000 patients. The records are sequentially
Systematic
(a) numbered from 1 to 10,000. A sample of 100
Sampling
patients was obtained by choosing patients
numbered 100, 200, 300, ……, 10,000.
A wholesale food distributor would like to test
the demand for a new food product. He
distributes food through five large supermarket Stratified
(b)
chains. The food distributor selects a sample of Sampling
stores from each chain and tests his new product
in these stores.
Interviewers station themselves near office
buildings, MRT stations and bus-stops to Convenience
(c)
interview people who pass by about a pending Sampling
increase in transport fares.
A private university has 5 groups (A,B,C,D,E)
of 10 students each pursuing a Masters in
Cluster
(d) Psychology programme. Only Group B was
Sampling
selected and the students interviewed about their
satisfaction with the programme.
To determine what class to put students into at a
school, names are entered into a software Simple random
(e)
program, which then randomly assigns students sampling
in each class.
6. Survey Methods
Face-to-face interviews
Advantages: Can gather in-depth attitudes, allow for probing and getting detailed responses.
Disadvantages: Relatively expensive and time consuming, may require quiet area to conduct.
20
Mail questionnaire
Advantages: Allow time for people to answer questions, minimal staff requirements, able to
cover large geographical area
Disadvantages: Low response rate, questions may be misunderstood, require pre-test to
minimize bias.
Focus groups
Advantages: Larger group of participants at one time, group dynamics generate ideas.
Disadvantages: Difficulty of scheduling, require strong facilitator, may need special
equipment to record.
7. Bias
Bias is said to occur when the sample results are systematically different from the truth about
the population.
This occurs when there is a tendency to include or exclude certain persons or units in the
sample. In other words, the sample selected does not accurately reflect the target population.
Examples:
- Using an online survey to research on the importance of smart technology in our lives may
exclude the elderly people.
- Using a call-in radio show that solicit audience participation on controversial topics like
gun control, setting up casinos etc. tend to over represent individual who have strong
opinions.
Responses given are inaccurate for various reasons like ambiguous question wording,
sensitivity of information, leading questions or lack of interviewer training. Here are some
examples of poor question wording:
“Do you shop regularly?” (the word “regular” is ambiguous)
“Has any family member been treated for behaviour disorder?” (sensitivity of information)
“Should online purchases be delivered on time as part of customer service?” (leading question)
This means a tendency for certain type of persons or unit not to respond to the study. Non-
responders may have some similar characteristics. In a mail survey, the upper and lower social
class tend not to respond, which indicates that the viewpoints of middle class are overly
represented.
21
8. Discussion questions
1. Explain whether the following variables are quantitative (state discrete or continuous)
or qualitative.
(a) The built-in area of a HDB 5-room flat.
(b) The colour of Kevin’s new sports car.
(c) The number of applications received by a university.
(d) The “hotline” telephone number of ABC Bank.
3. Indicate which of the following examples refer to a population and which refer to a
sample.
(a) An auditor selected 30 employee leave records from staff working at ABC Bank
for checking.
(b) Results of all students who sat for the examination were evaluated.
5. For each of the following statements, indicate whether the highlighted value is
parameter or a statistic:
(a) The average annual advertising expenditure was $20,000 obtained from a survey
of 50 retail stores at The Jewel, Changi Airport.
(b) The total amount of investment in financial products at all the branches of City
Bank was $288 million in 20X9.
6. Explain the MAIN type of bias that is evident in the situations below:
(a) A supermarket researching on expenditure of customers sent a questionnaire to
all its loyalty cardholders.
(b) A hotel requested its guests to drop by its admin office to do a survey about their
stay in the hotel.
(c) A researcher selected a group of senior bankers to study the cost of living in
Singapore.
(d) John, a grassroots member was asked to knock on every door of a HDB block
to get an idea about building a children’s playground in the vicinity.
(e) A student wrote the following question in a questionnaire. “Don’t you think the
very wealthy people should donate more to charity?”
22
7. Identify the type of sampling method used:
(a) At a birthday party, 50 children were each assigned a number from 01 to 50.
Five numbers were chosen at random to receive a prize.
(b) An auditor selected every 10th purchase order from a file for checking.
(c) In a study about credit card usage, customers were grouped into five groups
based on their spending level for the last six months. 10 customers were then
selected from each of these five groups.
23
9. Supplementary questions
2. For each of the variables listed, indicate whether it is a quantitative (state discrete or
continuous) or qualitative variable.
Scale of
Variable
Measurement
(a) Number of pet dogs owned by Ah Tim
(b) Consultation time with Dr Huan
(c) Chelsea’s dress size
Rating (scale of 1 to 10) of a car model by
(d)
a car magazine.
(e) Sugar content (in grams) of a soft drink
(f) Most popular music artiste this year
4. For the statements stated below, identify which refers to a population and which refers
to a sample.
(a) 20 bottles of wine in a production process were selected for a taste test.
(b) A nurse took the temperature and blood pressure of ALL patients at St Luke
Eldercare Centre.
5. Provide examples of situations where the following sample methods were used:
(a) Simple random sampling
(b) Systematic sampling
(c) Stratified sampling
(d) Cluster sampling
24
6. A garment factory has 200 workers. The average time taken to complete a particular
procedure by these 200 workers was 10.9 minutes. A sample of 20 workers was then
taken. The average time taken by these 20 workers was found to be 10.4 minutes.
(a) Which values represent parameters?
(b) Which values represent statistics?
7. A marketer wants to obtain feedback for the design of a new product packaging. Five
designs are being considered and respondents were asked to rank their preferences for
these designs (5= Most Preferred and 1= Least Preferred). The sample of respondents
was obtained by interviewing every 20th shopper who walks into a particular store.
25
BUSINESS STATISTICS
SESSION 2
1. Introduction
When data are recorded in the sequence that they are collected, they are known as raw data.
Such data are random and unranked.
Table 2.1 and table 2.2 show examples of qualitative raw data and quantitative raw data
respectively.
64 42 83 24 12 15
67 51 77 57 81 19
62 46 35 27 69 41
64 25 48 64 72 48
50 34 75 38 51 26
Table 2.2 Transactions ($) at a cafe
When data sets are small, it is relatively easy to observe difference among the raw values
or ungrouped data. However, with moderate to large data sets, the pattern of variability
become less apparent. Hence, it is better to tabulate the data into more readable formats.
A frequency table exhibits how the frequencies are distributed over various categories. From
the data in Table 2.1, the variable is major. The number of students belonging to the various
majors is called the frequency of that category.
26
Example 2.1
Set up a frequency table for the data in Table 2.1.
Solution:
The completed table is shown below:
Type of Major Number of Students
Business 7
Economics 5
Finance 6
Humanities 2
Others 5
Total 25
Table 2.3 Frequency Table : Type of Major
A bar chart (also called a bar graph) can be used to display qualitative data. We mark the various
categories on the horizontal axis and frequency counts on the vertical axis. Usually we leave a
gap between the categories.
Example 2.2
From the frequency table obtained in Example 2.1, construct a bar graph.
Solution:
8
7
Number of students
6
5
4
3
2
1
0
Business Economics Finance Humanities Others
Sometimes, in a bar chart, the categories are marked on the vertical axis and the frequencies on
the horizontal axis. This is known as a horizontal bar chart.
A pie chart is more commonly displayed in percentages, although it can be used to display
frequencies. The pie is divided into different portions that represent the percentages for the
different categories. These percentages are known as relative frequencies.
27
Example 2.3
Create a pie chart based on the information from Table 2.3
Type of Major Number of Students Relative frequency (%)
Business 7 28.0
Economics 5 20.0
Finance 6 24.0
Humanities 2 8.0
Others 5 20.0
Total 25 100.0
Table 2.4 Relative Frequency - Type of Major
Solution:
The completed pie chart is shown in Figure 2.2.
Others
20% Business
28%
Humanities
8%
Finance Economics
24% 20%
A frequency distribution shows a listing of the variable into groups of values known as classes
and the number of values (frequencies) falling into each class. Note that the classes always
represent the variable. The classes are non-overlapping; that is, each value belongs to one and
only one class. The frequency distribution is sometimes presented together with the relative
frequency, cumulative frequency or cumulative relative frequency.
Example 2.4
The following data show the value of transactions (in $) of 30 transactions at a local cafe.
64 42 83 24 12 15
67 51 77 57 81 19
62 46 35 27 69 41
64 25 48 64 72 48
50 34 75 38 51 26
Solution:
Here are the guidelines to set up a frequency distribution:
28
Step 1: Decide on the number of classes, k
We use 2k > n, where n represents the number of observations. In practice, number of classes
can be a subjective choice.
Equal class interval(i) or class width (w) is preferred. A class interval is defined as the
difference between the lower limits of 2 classes. To compute the class interval, we can use
𝐻−𝐿
𝑖≥
𝑘
where H = Highest value in the dataset
L = Lowest value in the dataset
k = Number of classes (obtained in step 1)
Highest value is 83 and the lowest value is 12 for this data set. So, we have
83 − 12
𝑖≥ = 14.2
5
We round up to a whole number, therefore we decide on a class interval, i = 15
Step 3: Set up the individual classes and count (tally) the number of observations for each
class.
The completed table is presented below together with additional information namely the
relative frequency and cumulative frequency columns. The relative frequency shows the
proportion (percentage) of observations falling within for each class grouping. The cumulative
frequency is a running total of the frequency counts. (See Table 2.5)
Note that when we have a class say, “10 up to 25”, it would include all values from 10 to less
than 25. In other words, a transaction value of $25 would be included in the class “25 up to
40”. Similarly, a transaction value of $40 would be included in the class “40 up to 55” and so
on.
29
For large data sets, manual tallying naturally becomes tedious. Hence, use of software packages
e.g. SPSS or Excel greatly assist in the setting up of frequency distributions or tables.
3.2 Histogram
A histogram is a chart that can be drawn for a frequency distribution or relative frequency
distribution. We mark classes on the horizontal axis and frequencies (or relative frequencies)
on the vertical axis. The columns are drawn adjacent to each other without leaving any gap
between them.
Example 2.5
Using the information from the frequency distribution in Table 2.5, construct a histogram.
Solution:
The completed histogram is shown Figure 2.3.
Note that we may also use class mid-points instead of class limits to label the horizontal axis.
A class mid-point is calculated by summing the lower and upper limit of a class and then
dividing by 2. Thus,
G*H#" '2.44 2313,IJKK#" '2.44 2313,
Class mid-point = L
A frequency polygon is a graphical display with lines connecting the intersection points of the
class mid-point and frequencies (or relative frequencies).
Example 2.6
The table below shows a frequency distribution of the ages (years) of all 50 employees of a
company. Compute the class mid-points. Thereafter, construct a frequency polygon.
30
Class Frequency Mid-point
20 up to 32 12 (20 + 32)/2 = 26
32 up to 44 17 (32 + 44)/2 = 38
44 up to 56 14 (44 + 56)/2 = 50
56 up to 68 7 (56 + 68)/2 = 62
50
Table 2.6 Computing Class mid-points
Solution:
The completed frequency polygon is shown in Figure 2.4
18
16
14
No of Employees
12
10
8
6
4
2
0
14 26 38 50 62 74
Age (years)
Note in Figure 2.4 that, to complete the frequency polygon, midpoints of 14 and 74 are added
to the X-axis to “anchor the polygon at zero frequencies. These two values, 14 and 74 were
derived by subtracting the class interval of 12 years from the lowest mid-point (26 years) and
by adding 12 years to the highest mid-pint (62 years) in the frequency distribution.
Both the histogram and the frequency polygon allow us to get a quick picture of the main
characteristics of the data (highs, lows and concentration of data etc.).
When plotted on a diagram, the cumulative frequencies give a curve that is called an ogive
(pronounced o-jive).
Example 2.7
Using the data from Table 2.6, compute the cumulative frequencies. Thereafter, construct a
cumulative frequency polygon.
Solution:
Class Frequency Cumulative frequency
20 up to 32 12 12
32 up to 44 17 12 + 17 = 29
44 up to 56 14 29 + 14 = 43
56 up to 68 7 43 + 7 =50
50
Table 2.7 Computing Cumulative Frequencies
31
To draw the ogive, the variable age is marked on the horizontal axis using the lower class limits
and the cumulative frequencies on the vertical axis. The dots are then marked above these limits
to correspond to the cumulative frequencies. (See Figure 2.5)
Note that the cumulative frequency polygon starts at the lower limit of the first class and ends
at the upper limit of the last class.
Another technique to present quantitative data is the stem and leaf diagram. An advantage of
the stem and leaf diagram is that we are able to view the data distribution featuring the actual
numerical values of the raw data.
The data are arranged in ascending order. Each numerical value is an observation divided into
two parts. The leading digit(s) is the stem and the remaining digit(s) forms the leaf. The
arrangement of leaves on the stems provides a pictorial representation of the distribution.
Example 2.8
The following data shows the dividend yield (in percent) of 12 blue chip stocks. Create a stem-
and-leaf diagram for this set of data.
4.5 3.7 4.4 3.8 7.7 3.8 3.5 3.4 3.0 4.6 3.9 2.3
Solution:
The first digit would form the stem and the second digit the leaf.
32
Sometimes, we may want to construct a stem and leaf diagram for three- and four-digit
numbers.
Example 2.9
The following data gives the monthly rents ($) paid by a sample of 30 households from a certain
city. Create a stem and leaf diagram.
429 540 550 578 585 620 650 660 675 732
750 750 765 780 800 820 840 870 871 880
900 930 950 956 975 989 1020 1020 1030 1070
Solution:
The stem and leaf diagram would appear as follows:
Stem Leaf
4 29
5 40 50 78 85
6 20 50 60 75
7 32 50 50 65 80
8 00 20 40 70 71 80
9 00 30 50 56 75 89
10 20 20 30 70
Stem unit = 100 Leaf unit = 1
Sometimes we may have bivariate data, that is, data for two variables where we want to
compare and find relationships. For 2 quantitative variables, a scatter diagram can be drawn.
If we have 2 qualitative (categorical) variables, we can summarise the data using a contingency
table.
The Scatter Diagram is the simplest method to study the relationship between two variables
where the values for each pair of variables are plotted together in the form of dots. The degree
to which the variables are related to each other depends on the manner in which the dots are
scattered over the chart. The more the dots plotted are scattered over the chart, the lesser the
degree of correlation between the variables.
Examples of scatter diagrams are shown in Figure 2.8. We can see that sales level and
advertising expenditure are linked. Sales level is also linked to the number of employees;
however, the relationship is weaker as the points are more scattered. In both charts, it is obvious
that an increase in the horizontal axis variable (either advertising expenditure or number of
employees) resulted in an increase in the vertical axis variable (sales).
We will deal more with scatter diagrams in Session 7 (Linear Regression & Correlation)
33
Sales ($)
Sales ($)
Contingency tables (also known crosstabulations or two-way tables) are used in statistics to
summarize the relationship between two categorical variables. A contingency table is like a
special type of frequency table, where two variables are shown simultaneously.
Table 2.8 shows a sample of 100 persons from 3 different age groups and their preferred
activity.
Age Group
Preferred
< 25 years 25 to 50 years Over 50 years Total
activity
Brisk walking 0 14 15 29
Swimming 20 11 10 41
Tennis 20 5 5 30
Total 40 30 30 100
Table 2.8 Contingency Table
Is preferred activity related to age group? You would probably be able to see some relationship
by examining the frequency counts from the table.
We will deal more with contingency tables in Session 11 (Analysis of Categorical Data: Chi-
Square Test of Independence).
34
5. Discussion questions
1. D’drink Inc asks 100 randomly sampled customers to take a taste test and select the
beverage they preferred most. The results are shown in the following table:
Beverage Number
Cola-plus 40
Coca-cola 25
Pepsi 20
Lemon-lime 15
TOTAL 100
75 54 62 79 79 53 67 60 60 105
58 51 69 65 90 98 82 93 60 93
74 77 42 84 88 69 74 73 64 114
Construct a frequency distribution. Use the value 40 as the lower limit of the first class.
Also, compute the relative frequencies.
3. The annual imports of a selected group of electronic suppliers are shown in the
following frequency distribution.
35
(c) On the basis of the cumulative frequency polygon, how many employees earn
$11 an hour or less? Half of the employees earn an hourly wage of how much
or more? Four employees earn how much or less?
5. The rate of return (%) of 21 unit trusts during a boom year are shown below :
8.3 9.6 9.5 9.1 8.8 11.2 7.7 10.1 9.9 10.8
10.2 8.0 8.4 8.1 11.6 9.6 8.8 8.0 10.4 9.8 9.2
6. A survey was carried out on readers of a travel magazine. Respondents were asked
the amount they spent on holidays in the previous year. The frequency distribution is
shown below:
Amount Spent ($) Frequency
3000 up to 6000 22
6000 up to 9000 53
9000 up to 12000 19
12000 up to 15000 1
15000 up to 18000 3
7. A megamall is considering revising its parking fees because of extreme heavy traffic
on weekends. A study was carried out to record the duration of stay (in minutes) of
500 cars entering the car park on a Saturday.
36
6. Supplementary questions
1. A Food Company has been serving a large range of breakfast cereals with an additional
flavoring, Koko Krunch, which has gained popularity among its customers. The
company is interested in finding out the customer preferences for Koko Krunch versus
Trix, Honey Stars, Cookie Crisp, and Snow Flakes. 360 customers were surveyed at
random by getting them to take a test and select the breakfast cereal they preferred the
most. The results are as shown below:
Breakfast Cereal Number
Koko Krunch 115
Trix 63
Honey Stars 102
Cookie Crisp 43
Snow Flakes 37
Total 360
(a) Is the data presented above qualitative or quantitative? Explain your answer.
(b) What do you call this table? What information does it present?
(c) Create a bar chart to illustrate the information above.
(d) Create a pie chart using the information above.
2. The table below shows the frequency distribution for the profit earned on furniture sold
in December 20x8 at the Orange Lights Furnishing Pte Ltd.
Profit ($) Frequency Relative
Frequency (%)
100 up to 300 18
300 up to 500 38
500 up to 700 57
700 up to 900 43
900 up to 1,100 33
1,100 up to 1,300 11
Total 200
3. A company sold the following number of units of laptops in the last 14 months.
33 38 49 27 38 29 38
38 38 48 39 50 29 31
37
4. The following data give the prices ($) for a sample of 20 books on investment tips.
32.5 24.0 37.0 67.9
43.7 49.0 62.7 15.8
19.5 59.5 31.7 54.7
29.7 44.5 22.7 27.4
53.4 17.4 47.5 43.2
(a) Construct a frequency distribution table for the above data using “10.0 up to
20” as the first class.
(b) Calculate the relative frequencies.
5. In CKS store, a customer service counter was set up to cater to enquiries and requests
from customers. The frequency distribution below shows the waiting time of customers
on a particular day.
Waiting Time (in minutes) No of customers
0 up to 5 4
5 up to 10 3
10 up to 15 6
15 up to 20 7
20 up to 25 4
25 up to 30 3
6. The following frequency distribution shows the order lead time (time elapsed between
when an order is placed and when it is filled) at TwinkleBell.com, an online blogshop
retailer.
38
7. The following are the responses of 20 students of an accounting class who were asked
to evaluate the usefulness of the course. The students were asked to choose one of five
responses: Excellent (E), Above average (AA), Average (A), Below average (B), and
Poor (P).
AA B A E AA
E AA B AA E
E A B AA E
A E P P B
8. Construct a scatter diagram for the following sample data. Describe the relationship
between the following X and Y values.
X 10 15 11 17 11 9 12 10 11 13
Y 5 6 7 8 4 3 6 2 2 7
Gender
Ice-cream Ordered Total
Women Men
Yes 39 19 58
No 11 31 42
Total 50 50 100
39
10. AA Fashion is considering a merger with BB Fashion. The board of directors surveyed
100 stockholders regarding their position on the merger. The following table below
presents the results of the survey.
Opinion
Number of
Total
Shares Held Favor Oppose Undecided
Under 100 16 9 3 28
100 up to 500 21 9 1 31
500 up to 1,000 13 11 1 25
Above 1,000 9 5 2 16
Total 59 34 7 100
(a) State the level of measurement for the two variables used in the table above.
(b) Name the table presented above.
(c) State the group that is most in favour of the merger.
40
BUSINESS STATISTICS
SESSION 3
1. Introduction
In Session 2, we discussed how to organize data and present them using various charts. These
techniques are insufficient when we need to describe the main characteristics of data set.
Numerical summary measures that provide the centre and spread of a distribution will show us
the main features of data sets. These measures are known as measures of central tendency and
measures of dispersion.
Data in its original form are known as raw or ungrouped data. The summary measures that
give averages are called measures of central tendency (also known as measures of location).
We shall look at three such measures – the mean, the median and the mode.
2.1 Mean
The mean, also called the arithmetic mean, is the most frequently used measure of central
tendency. For ungrouped data, the mean is obtained by dividing the sum of all values by the
number of values in the data set. The mean calculated for sample data is denoted by 𝑥̅ (read
as x-bar) and the mean calculated for population data is denoted by µ (a Greek character read
as mu).
41
∑Q
Population Mean : 𝜇= R
∑Q
Sample Mean : 𝑥̅ = &
where ∑ 𝑥 is the sum of all values of x, N is the population size, n is the sample size.
Example 3.1
The monthly salaries ($) of a sample of five employees for an IT company are as follows.
3700 4500 5300 7250 8250
Find the mean monthly salary.
Solution:
The mean will be denoted by 𝑥̅ since the data comes from a sample.
∑𝑥
𝑥̅ =
𝑛
TUVVIWXVVIXTVVIULXVIYLXV
= X
= $5,800
Sometimes a data set may contain a very large or a very small value. Such values are called
outliers or extreme values. A major disadvantage of the mean is that it is very sensitive to
outliers. Example 3.2 illustrates this.
Example 3.2
Further to Example 3.1, suppose an additional employee was sampled making a sample size of
6. The salaries of these six employees are:
3700 4500 5300 7250 8250 25000
Compute the new mean value.
Solution:
TUVVIWXVVIXTVVIULXVIYLXVILXVVV
The new mean will be 𝑥̅ = Z
= $9,000
The impact of the outlier is that it causes the mean to increase substantially from $5,800 to
$9,000. We should remember that the mean is not always the best measure of central tendency
because it is heavily influenced by outliers.
2.2 Median
The median is the value of the middle term in a data set with n values after arranging the values
in ascending order.
42
Example 3.3
Given the age (years) of 5 college students : 21, 25, 19, 20, 22
Find the median.
Solution:
First, arrange the 5 values in ascending order: 19 20 21 22 25
XId
Median position = L = 3rd value
Hence, the median is 21 years.
Example 3.4
Given the weight (kilograms) of 4 basketball players: 76 73 80 75
Find the median.
Solution:
First, arrange the 4 values in ascending order : 73 75 76 80
WId
Median position = L = 2.5th value
This means that the median lies between the 2nd and the 3rd value. The median is found by
taking the average of the two numbers.
UXIUZ
Hence, the median = L
= 75.5 kilograms
2.3 Mode
Example 3.5
Find the modal weekly earnings for the following samples:
$ 144 185 163 144 185 196 " 144 and 185 (bimodal)
The main shortcoming of the mode is that a data set may have not have a mode or may have
more than one mode. A dataset with one mode is said to be unimodal. If there are two modes,
it is bimodal. If it contains more than two modes then it is multimodal.
Looking at the three measures of central tendency, we cannot conclude which is definitely a
better overall measure. Each of them may be better under different situations. Nevertheless,
the most commonly used measure is the mean followed by the median. The median is a better
measure when the data set has outliers.
Two of the many shapes of a distribution are the symmetric and the skewed distributions. In
Session 2 we learnt about histograms and frequency polygons. We shall now look at the values
of the mean, median and mode for different shapes of distribution.
43
3.1 Symmetric Distribution (Mean=Median=Mode)
If we have a symmetric histogram and frequency polygon with a single peak, the mean, median
and mode will be equal and they lie at the centre of the distribution. (See Figure 3.2)
If we have a histogram and a frequency polygon that is skewed to the right (See Figure 3.3),
the mean is the largest and the mode is the smallest. The median will lie between the mean and
the mode. The mean is the largest because it is affected by large value outliers which pulls up
the value of the mean.
If we have a histogram and a frequency polygon that is skewed to the left (See Figure 3.4), the
mean is the smallest and the mode is the largest. The median will again lie between the mean
and the mode. The mean is the smallest because it is affected by some small value outliers
which pulls down the value of the mean.
44
Figure 3.4 Negatively Skewed Distribution
Measures of central tendency give us an idea of the typical middle value in a dataset. However,
these measures do not reveal the full picture of the distribution. Two distributions may have
similar means but the spread (also known as dispersion or variability) of data could be
completely different.
Listed below are the test scores for two groups of students.
Class A: 55, 56, 57, 58, 59, 60, 60, 60, 61, 62, 63, 64, 65
Class B: 35, 40, 45, 50, 55, 60, 60, 60, 65, 70, 75, 80, 85
We can see that the two data sets have equal mean, median, and mode. However, the scores
for Class B are much more dispersed compared to Class A. Hence, we need some measures
that help us know about the spread of data. These measures are called Measures of Dispersion.
We shall look at just three (Range, Variance and Standard Deviation) amongst a number of
measures of dispersion.
4.1 Range
The range is obtained by taking the difference between the largest and the smallest values in a
data set.
Example 3.6
The closing month-end stock price ($) for DBS Bank over the last six months are as follows:
25.20 24.90 22.70 27.80 28.30 29.00
Find the range.
Solution:
Range = Largest value – Smallest value
= 29.00 – 22.70
= $6.30
45
The range, like the mean, has the disadvantage of being influenced by outliers. Hence, it may
not be a good measure of dispersion for datasets with outliers.
Another disadvantage is that the range uses only two values in the dataset regardless of the
dataset size. All other values are ignored.
The variance is defined as the average of the squared deviations of the data values from the
mean. The variance calculated for population data is denoted by s2 (read as sigma squared),
and the variance calculated for sample data is denoted by s2.
∑(QgQ̅ )j
Sample variance: 𝑠L = &gd
The answers for variance is expressed in squared units, for example, dollars2, minutes2 etc.
which are actually not meaningful. For this reason, we obtain the standard deviation by taking
the square root of the variance.
∑(Qgh)j
Population standard deviation: 𝜎=k R
∑(QgQ̅ )j
Sample standard deviation: 𝑠=k &gd
Standard deviation is the most frequently used measure of dispersion. It tells us how closely
the values of the data set are clustered around the mean. A small value standard deviation
indicates a smaller variability of the data values around the mean. The standard deviation is
expressed in the same unit as the original variable value.
Example 3.7
Given the following data set: 6 3 8 5 3
Compute the variance and standard deviation
(a) Assuming that the data are from a population.
(b) Assuming that the data are from a sample.
Solution (a):
Assuming population data….
∑𝑥
𝜇=
𝑁
ZITIYIXIT
= X
=5
∑(𝑥 − 𝜇)L
𝜎L =
𝑁
46
(ZgX)j I(TgX)j I(YgX)j I(XgX)j I(TgX)j
= X
= 3.6
𝜎 = √3.6 = 1.897
x µ (x – µ)2
6 5 1
3 5 4
8 5 9
5 5 0
3 5 4
S x = 25 S (x – µ)2=18
∑𝑥
𝜇=
𝑁
LX
= X
=5
L
∑(𝑥 − 𝜇)L
𝜎 =
𝑁
dY
= X =3.6
𝜎 = √3.6 = 1.897
Solution (b):
Assuming sample data….
∑𝑥
𝑥̅ =
𝑛
ZITIYIXIT
= X
=5
∑(𝑥 − 𝑥̅ )L
𝑠L =
𝑛−1
(ZgX)j I(TgX)j I(YgX)j I(XgX)j I(TgX)j
= Xgd
= 4.5
𝑠 = √4.5 = 2.121
There are alternative computation formulas for calculating the standard deviation:
∑ Qj ∑ Q j g&Q̅ j
𝜎=k R
− 𝜇L 𝑠=k &gd
47
5. Mean, Variance and Standard Deviation for Grouped Data (samples only)
When data are given in the form of a frequency distribution, we will not know the actual values
of the individual observations. Hence, we need to find an approximation for the mean of
grouped data. We will look at the mean for samples only.
∑ 𝑓𝑀
𝑥̅ =
𝑛
where M = midpoint of a class and f = frequency of a class
Example 3.8
A company recently surveyed a sample of employees to determine how far they lived from their
corporate headquarters. The results are shown below:
Distance (kilometres) Number of Employees
0 up to 5 4
5 up to 10 15
10 up to 15 27
15 up to 20 18
20 up to 25 6
Compute the mean.
Solution:
For grouped data, we need to compute the mid-point of each class. The purpose of the mid-
point is that it is used as an estimate for all the values in that particular class.
∑ 𝑓𝑀
𝑥̅ =
𝑛
𝟒(𝟐.𝟓)I𝟏𝟓(𝟕.𝟓)I𝟐𝟕(𝟏𝟐.𝟓)I𝟏𝟖(𝟏𝟕.𝟓)I𝟔(𝟐𝟐.𝟓)
= 𝟕𝟎
𝟗𝟏𝟎
= 𝟕𝟎
= 13 kilometres
The formulas used for calculating sample variance and standard deviation for grouped data are:
∑ +(xgQ̅ )j
Sample variance 𝑠L = &gd
∑ +(xgQ̅ )j
Sample standard deviation 𝑠=k &gd
48
Example 3.9
We refer back to the data in Example 3.8. Compute the sample standard deviation for the
grouped data.
Class Frequency (f)
0 up to 5 4
5 up to 10 15
10 up to 15 27
15 up to 20 18
20 up to 25 6
Total 70
Solution:
For grouped data, the mid-point for each class is first computed.
dYVU.XV
= UVgd
= 5.118 kilometres
6. Empirical Rule
49
Example 3.10
A set of 200 observations has a mean (µ) of 100 and a standard deviation (s) of 10. If the
distribution is symmetrical, approximately how many observations should be found in the
interval 80 to 120?
Solution:
The values 80 to 120 is approximately ± 2s from the mean of 100 obtained as follows:
𝜇 ± 2𝜎 = 100 ± 2(10) = 80 to 120
According to empirical rule, about 95% of observations fall within 2 standard deviations from
the mean.
Example 3.11
The evaluation score of all employees in a company follows a symmetric bell-shaped curve,
with a mean of 3.8 (out of total score of 5) and variance of 0.3. Harry’s score is 4.0. The
company recognises the top 16% of performing employees as the top talent. Is Harry a top
talent?
Solution:
Applying the empirical rule, we can derive at the following diagram:
Hence, employees with evaluation score of 4.35 and above would be considered top talent.
Since Harry’s score is 4.0 which is less than 4.35, Harry is not a top talent.
50
7. Discussion questions
1. All the students in an advanced mathematics course form a population. Their course
grades are:
2. The number of work stoppages in the construction industry for selected months are:
3. Many regular customers of Jinny Group purchased hair packages of varying dollar
amounts . A sample of 8 customers showed the amount (in dollars) of hair packages
purchased by these customers.
4. A company has an office in Location A that hired five audit trainees. The monthly
starting salaries were:
5. The years of service for a sample of seven employees at a telco retail outlet are:
4, 2, 5, 4, 5, 2 and 6.
51
6. The net profits of a sample of large importers of household products were organized
into the following table:
Net Profits ($m) Number of importers
2 up to 6 1
6 up to 10 4
10 up to 14 10
14 up to 18 3
18 up to 22 2
52
8. Supplementary questions
2. The following data represent the weight in kilograms of several randomly selected
batches of dried goods arriving at a port last month.
Weight (kilograms) Number of batches
0 up to 25 5
25 up to 50 23
50 up to 75 8
75 up to 100 6
100 up to 125 4
4. The following table shows the sales turnover of Dinesh Sundries Store for the month of
July.
Daily Sales ($) Number of Days
0 up to 1000 1
1000 up to 2000 2
2000 up to 3000 3
3000 up to 4000 8
4000 up to 5000 12
5000 up to 6000 5
53
6. A variable has a unimodal distribution with mean 30 and median of 45. Is the
distribution skewed to the left, to the right or symmetric? Explain and give a rough
sketch of the distribution.
8. The following table shows the frequency distribution of the weights of a sample of 55
luggage being checked in on an airplane by passengers travelling in the premium
economy class.
Weight (kg) Frequency
5 up to 15 5
15 up to 25 7
25 up to 35 19
35 up to 45 17
45 up to 55 7
9. The following data shows the years of service of all the teachers in a private school.
10. Crystal, a secondary school student took 5 quizzes last week. The scores are listed
below:
8 98 24 18 27
54
BUSINESS STATISTICS
SESSION 4
_______________________________________________________________________
1. Introduction
Excel is a powerful application that offers a large number of functions, tools and options for
use. With knowledge of Excel, you can organize and manipulate data, perform computations
as well as create charts. This would allow you to conduct better data analysis and assist you in
decision making. As such, Excel has been widely used in many organisations today.
In this session, you will be introduced to several functions that would be useful for you to
analyse large data sets.
Here we cover some of the commonly used commands for descriptive statistics, for example:
o Sum
o Average (mean)
o Median
o Mode
o Count
o Minimum
o Maximum
o Sample standard deviation
o Frequency distribution
55
Example 4.1
Input the following data into cells A1 to A10 which show the ages of a sample of persons in a
tour group:
Based the raw data, we can use some useful statistical functions in Excel to perform required
tasks.
56
The answers are shown below:
Based on the raw data of sample ages in Example 4.1, we now wish to set up a frequency
distribution using 3 classes.
The lower and upper limits must be keyed into separate cells.
The raw data is at range A1:A10. We can now create the command to fill up the frequency
column from cells D14:D16.
Steps:
o Select the cell range D14:D16 (note: need to highlight the WHOLE range)
o The command syntax is as follows:
=FREQUENCY(data_array,bin_array)
where
data array refers to the raw data (A1:A10) and
bin array refers to the upper limits (C14:C16).
o Next, type the following command:
=frequency(A1:A10,C14:C16)
o To execute the command, press CTRL-SHIFT-ENTER
(note: if you simply hit ENTER key, an error will occur)
o You can sum up the frequency column using =SUM(D14:D16) and displaying the
answer at Cell D17.
57
The final output is shown below:
2.3. Histogram
Example 4.2
Create a histogram based on the frequency distribution below:
Hours spent studying No. of Respondents (frequency)
0 up to 10 3
10 up to 20 2
20 up to 30 5
To obtain a histogram, select the Range of frequency counts and then select Insert ->Chart-
>Column. A column chart will be generated.
Double-click the bars and select Format Data Series to Reduce the Gap Width to 0%. Add
suitable title and axis labels to the histogram using Chart Design/Tools function
58
2.4 Pivot Tables and Charts
The Pivot Table feature of Excel is useful and versatile. This tool makes it possible to
summarise your raw data into more informative tabulations.
Example 4.3
The raw data show the favourite colour provided by 15 children. Input the values to Excel.
To create a PIVOT table, select the entire range (K1:K16) including the header.
Then select Insert, Pivot Table.
In the resulting Pivot Table framework drag “colour” item to the Row Labels area and then
Drag the “colour” item into the value area:
59
Next, Right click in the value area, to change the value “Field setting” to Count since we are
doing a count of the number of children in each category.
To obtain a pie chart, simply highlight the range A4:B7 then select Insert Chart/Pie.
The final pie chart is shown below:
PIE CHART
You may add title, data labels to the chart using Chart Design/tools and change accordingly.
Example 4.4
The pivot table command can also be used for two variables to form a cross-tabulation.
Suppose we now have the data for 2 variables – Gender and favourite Colour.
60
We want to do a cross tabulation of Gender (row variable) versus Colour (column variable).
The steps are :
o Select the entire range J1:K16 including the headers.
o Select Insert, Pivot Table
o Drag “GENDER” item to the Row Labels area.
o Drag “COLOUR” item to the Column Labels area.
o Drag either “GENDER” or “COLOUR” to the Values Area
o For the Values Area, change field setting to Count if required.
The final cross tabulation (also known as a contingency table) is shown below:
61
2.5 VLOOKUP (Vertical look up) Function
VLookup is an Excel function to look up and retrieve data from a specific column in a table.
Example 4.5
In this example, we wish to provide a comment (Good, Average and Unsatisfactory) depending
on the scores obtained by a group of students as shown in the table below:
We shall now use the vlookup function to insert the comments for the various scores obtained
by 10 persons.
Our command would be =VLOOKUP(A2, D2:G4, 4). The final answers are shown below:
62
EXCEL LAB EXERCISE 1
2. The 3 most popular type of burgers purchased by 15 customers over the last one hour
are as follows:
Fish Chicken Vegetable Vegetable Chicken
Chicken Fish Chicken Chicken Chicken
Fish Chicken Vegetable Fish Fish
3. The following data shows the gender and the brand of cars purchased by a sample of
20 new car buyers.
GENDER BRAND GENDER BRAND
Male Honda Male Toyota
Male Nissan Female Toyota
Male Toyota Male Nissan
Female Nissan Female Honda
Male Honda Male Toyota
Female Nissan Female Mazda
Female Mazda Male Toyota
Female Toyota Female Mazda
Male Nissan Male Honda
Male Toyota Male Mazda
Using GENDER as the row variable and BRAND as the column variable, create a
contingency table. [use Pivot Table function]
63
EXCEL LAB EXERCISE 1 (ANSWERS)
1.
2.
Count of
Row Labels ITEM
Chicken 7
Fish 5
Vegetable 3
Grand Total 15
Chart Title
8
7
6
5
4
3
2
1
0
Chicken Fish Vegetable
64
3.
Count of
BRAND Column Labels
Row Labels Honda Mazda Nissan Toyota Grand Total
Female 1 3 2 2 8
Male 3 1 3 5 12
Grand Total 4 4 5 7 20
4.
RATING DESCRIPTION
9 Good
8 Good
8 Good
7 Average
7 Average
6 Average
5 Average
7 Average
8 Good
3 Poor
65
EXCEL LAB EXERCISE 2
(with step- by- step guide)
Joe has requested that you carry out an analysis of the used car market using the following data:
2. Compute the minimum, maximum, mean, median and standard deviation for the resale price.
3. Create a frequency distribution table, relative frequency and cumulative frequency based on the
resale price. You are required to group the data into appropriate classes using class interval of
$22,000. Use $40,000 as lower limit of the first class.
4. Create a pie chart to show the proportion of vehicles under the different categories.
5. Create a histogram for resale price using the classes obtained in part (3).
6. Create a scatter plot with the horizontal axis as Category variable and the vertical axis as the
resale price variable.
7. Create a contingency table using Category as the row variable and No of Owners as the column
variable.
8. Create a Pie Chart for variable Age with 3 categories namely “NEW” (for cars that are 1 to 3
years old), “AVERAGE (for cars that are between 4 to 6 years old) and “OLD” (for cars that are
between 7 to 10 years old). [Use VLOOKUP command]
66
PART 1: CREATING THE DATA
Assume that your data for Resale Price is from Cells D2 to D21, you can click on the fx icon to find the
values for the Minimum, Maximum, Mean Median and Standard Deviation.
Minimum
=MIN(D2:D21)
Maximum
=MAX(D2:D21)
Mean
=AVERAGE(D2:D21)
Median
=MEDIAN(D2:D21)
Standard Deviation
Sample Standard Deviation:
=STDEV.S(D2:D21) OR
=A34+ 22000
Classes
Go to Cell A34 and enter 40000 as your lower class limit.
For the upper class limit you can use (Lower class limit + Interval value) at cell C34 as follows:
=A34+21999.99
Complete the rest of the table (you may use COPY command) and format all numbers to WHOLE
numbers.
67
Step 2 : Creating a Frequency Column
Important: After selecting the data array and bin array for the frequency command, press
CTRL-SHIFT-ENTER to execute the command. (all frequency counts will automatically appear in the
relevant cells)
Add in the Total Frequency, Relative Frequency, Cumulative Frequency and Mid points
o Create the Total Frequency at Cell D39 by using the following formula:
Example: =SUM(D34:D38)
o You can compute the Relative Frequency by dividing each frequency count by the frequency
total.
Example: Enter the following at Cell E34
=D34/$D$39
o Cumulative Frequency can be computed by adding the value of the previous frequency count.
At Enter command
Cell F34 = D34
Cell F35 =+F34+D35
Cell F36 =+F35+D36
and so on
o Mid points can be computed using (Lower Limit + Upper Limit)/2 e.g (A34+C34)/2.
PART 4: Draw a pie chart to show the proportion of vehicles under the different
categories.
68
First you should click and drag
the Category in the box under
the Field name and place it in
the Rows area below.
After that click and drag the
Category in the box under the
Field name again and place it in
the Values area below.
A frequency table will be generated. Amend the row labels to Category1, Category 2 and Category 3.
To create a pie chart, select data by highlighting the range A5:B7 then
Select Insert Chart/Pie and choose 3-D Pie tab to get the following chart.
69
Adding Percentages to the Pie Chart
You can add the percentages to the Pie Chart by placing cursor on your pie chart and right click. Select
Add Data Label. Next right click again and select Format Data Label and check the Percentage option.
Select the frequency column data at D34:D38, then click Insert chart and select Column chart.
Select any column and Right-click. Select “Format data series”/Options and change gap width to 0.
To change x-axis labeling to class midpoints, Click on the Chart, Select Data and enter G34:G38 for
Category (X) axis labels as shown below:
70
PART 6: Create a scatter plot using Category variable on the horizontal axis and Resale Price
variable on the vertical axis
Highlight the range of two columns of data where you would want to plot one variable against another.
(Note: DO NOT select the variable names). Click Insert/Chart/Scatter to get the following chart.
You can include axis titles by selecting the Chart Design, Add Chart Elements on the menu as follows:
71
PART 7: Create a Contingency Table
Select the data in the Category and No of Owners (including the headers). Note : It is alright for the
range to include more than 2 columns e.g. A1:F21. Select Insert Pivot Table to obtain the following:
Click OK and then proceed to use Pivot Table function to generate a contingency table with Category
for the Row and No of Owners for the Column.
Click and drag the field name Category to Row area box below.
After that, click and drag field name No of Owners place it in the Columns area box.
Next drag the Category (you can use No of Owners too) into the Values box and amend the setting to
Count of Category. See chart below.
Change the row labels to Category 1, Category 2 and Category 3. The final output will appear as
follows:
72
PART 8: Using VLOOKUP command
Column 4
Go to Cell G1 to create a new variable called AgeGroup. Now, use the following command at Cell G2
:
=VLOOKUP(F2,$J$4:$M$6,4)
Frequency distribution range
The command allows you to check the Age value in F2 against the lower class value of the classes
from $J$4 to $M$6 and assign the respective Classification Value on Column 4 to G2.
Use the PIVOT table command to generate a PIE chart (with 3 segments) for AgeGroup.
Please refer to the steps in Part 4 to create the PIE chart.
73
BUSINESS STATISTICS
SESSION 5
PROBABILITY
1. define basic terms used in probability, namely experiment, outcome/event and sample
space.
2. understand mutually exclusive events and independent events.
3. use Venn Diagrams for computing probabilities
4. understand and apply basic addition and multiplication rules and special addition and
multiplication rules in computing probabilities
5. understand and apply conditional probabilities
6. apply Bayes’ theorem and draw Tree Diagrams.
7. compute mean, variance, standard deviation of a Discrete Probability Distribution.
_________________________________________________________________
1. Introduction
A probability is a measure of the chance that an event will happen. Its value ranges from 0 to
1. If the probability of an event is 1, the event will surely happen. If the probability of an event
is 0, the event will never happen.
Probability is everywhere. In weather forecast, we may ask “What is the chance of rain”? In
investment decisions, we may ask “What is the probability of earning at least 10% on this
investment?”. In a volleyball match, a fan may ask “What is the chance of Team A winning
the match?”
If people had perfect information about the future as well as the present and the past, there
would be no need for decision makers to consider the concepts of probability. However, since
we cannot eliminate uncertainty from our lives, we need to recognize its presence and use
probability concepts in the process of making decisions.
2. Definitions of Probability
Classical Probability
The classical probability rule is applied to compute probabilities of events for an experiment
where all outcomes are equally likely.
Example 5.1
Find the probability of obtaining a Head and the probability of obtaining a Tail for one toss of
a coin.
74
Solution:
P(Head) = 1/Total number of outcomes = ½
P(Tail) = ½
Empirical Probability
The empirical definition applies when the number of times the event happens is divided by the
number of observations. In such cases, to calculate probabilities we either use past data or
generate new data by performing the experiment a large number of times. The relative
frequency of an event is used as an approximation for the probability of that event.
Example 5.2
Ten of the 500 randomly selected components produced at a certain factory are found to be
defective. What is the probability that the next component manufactured at this factory is
defective?
Solution:
We can list the frequency and relative frequency for this example
From the relative frequency column, the probability that a component is defective is 0.02. This
is an approximate probability. However, if the experiment is repeated again and again, the
approximate probability of an event will approach the actual probability. This is called the Law
of Large Numbers.
Subjective Probability
Subjective probability is based on whatever information is available. It is based on the
individual’s own judgment, experience, information and belief. A soccer player may assign a
high probability to the chance of the team winning a game whereas the coach may assign a low
probability to the same event.
Experiment
This refers to any procedure or process that yields a result or an observation.
Event
This is the collection of one or more outcomes of an experiment.
- Events are mutually exclusive if the occurrence of any one event means that none of the
other events can occur at the same time.
- Events are independent if the occurrence of one event does not affect the occurrence of
another.
- Events are collectively exhaustive if at least one of the events must occur when an
experiment is conducted.
75
Outcome
This is the particular result of an experiment, in other words, what actually happens.
Sample Space
This is the set of all possible outcomes for an experiment. The sample space is typically called
S. e.g. S = {1, 2, 3, 4, 5, 6} when we roll a die.
Example 5.3
A white die and a black die are each rolled once and the number of dots showing on each die is
observed. We then add up the total number of dots on both dice. State the experiment, outcome
of interest and the sample space.
Solution:
Experiment Each die is rolled once.
Outcome of Interest Sum of the number of dots that face up on
both dice
Possible Outcomes
S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
(Sample Space)
Example 5.4
An experiment consists of drawing one marble from a container that contains a mixture of red,
blue and yellow marbles. State the experiment, outcome of interest and the sample space.
Provide an example of an event.
Solution:
Experiment Drawing one marble from a container
containing red, blue and yellow marbles
Outcome of Interest Color drawn
Possible Outcomes (Sample Space) S = {red, blue and yellow}
Event example Color drawn is not Blue
4. Venn Diagrams
76
Examples of venn diagrams depicting various relationships between events are shown below:
The two events A and B are mutually exclusive as they have no common outcomes. Hence,
P(A and B) = 0.
Intersection
The intersection (shaded area Figure 5.3) of the two events A and B is denoted by A ÇB. This
means that both events A and B can occur concurrently.
Union
The Union of two events A or B is denoted by (AÈ B) . This means either Event A or Event B
has occurred, that is, equivalent to either A or B or both. The events are not mutually exclusive
i.e. P(A and B) ¹ 0.
77
Complement
The complement rule is used to determine the probability of an event occurring by subtracting
the probability of the event not occurring from 1. The complement of A is denoted by ~A.
Hence, we have
P(A) + P(~A) = 1 or P(~A) = 1 - P(A)
5. 1 Addition Rule
If A and B are two events that are not mutually exclusive, then P(A or B) is given by the
following formula:
P(A or B) = P(A) + P(B) – P(A and B)
78
Example 5.5
Students in a school are taking swimming and tennis lessons in the following proportions:
(a) Find the probability that a student takes either swimming or tennis lessons?
(b) Draw the venn diagram.
Solution(a):
P(S or T) = P(S) + P(T) - P(S ÇT)
= 0.64 + 0.35 – 0.2
= 0.79
Solution(b):
If Event A and Event B are mutually exclusive, then P(A Ç B) = 0 since the events cannot occur
together. Then, P(A or B) will be:
Equals 0, thus we leave this out
from the formula.
P(A or B) = P(A) + P(B) - P(A and B)
Example 5.6
The following data were collected on SQA Airlines about their flights from Destination Y to
Destination Z over the last six months.
79
Solution:
P(A or B) = P(A) + P(B)
LVV dXV
= LVVV
+ LVVV
= 0.175
The probability that events A and B happen together is called the joint probability of A and B
and is written as P(A and B).
Two events are independent if the occurrence of one event does not affect the occurrence of the
other. The formula is written as:
Example 5.7
You throw a coin twice. What is the probability of getting a Head for both throws?
Solution:
d d d
P(1st H and 2nd H) = L 𝑋 L
= W
Example 5.8
Richard has recently purchased two stocks, Wilma and Gending. The probability that Wilma
stock will increase in value next year is 0.5 and the probability that Gending stock will increase
in value next year is 0.7. Assume that the two stocks are independent.
(a) Find the probability that both stocks increase in value next year.
(b) Find the probability that at least one of these stocks increase in value next year.
Solution(a):
Let W = Wilma increases in value and G = Gending increases in value
Solution(b):
P (at least one increases in value)
= 1 – P (both did not increase in value)
= 1 – [P (~W and ~G)]
= 1- (0.5 X 0.3)
= 0.85
80
5.2.2 Dependent events
When events are not independent, the joint probability P(A and B) is given by the following
formula:
P(A and B) = P(A) X P(B|A)
It is obvious from the multiplication rule for dependent events that if we know the probability
of events A and B, then we can calculate the conditional probability of B given A if required.
Example 5.9
There are five coloured balls in a box namely 3 yellow (Y) balls and 2 red (R) balls. Two balls
are drawn successively without replacement.
(a) What is the probability that both balls are yellow?
(b) What is the probability that the two balls are of different colours?
Solution(a):
T L T
P(Y1 and Y2) = X 𝑋 W
= dV
Solution(b):
P(Different colours) = P(Y1 and R2) or P(R1 and Y2)
T L L T
= (X 𝑋 W) + (X 𝑋 W)
T
=X
6. Contingency Table
A contingency table is a table used to classify sample observations according to two categorical
variables.
Example 5.10
The President of a university has proposed that all students take a course in ethics as a
requirement for graduation. Faculty members and students from this university were asked
about their opinion on this issue. The results are presented in the table below:
81
(a) Given that a randomly selected person is a faculty member, what is the probability that
the person agrees with the proposal?
Solution(a):
This is conditional probability and can be calculated without using the formula.
Note that there are 400 faculty members out of which 110 agree with the proposal.
ddV
Hence, P (A|F) = WVV = 0.275
(b) Find the probability that a randomly selected person is a faculty member and agrees
with the proposal.
Solution(b):
This is a joint probability which refers to the intersection of two events. Without the use of the
multiplication rule, we note that the total sample size is 1000 out of which the events “faculty”
and “agree” intersect at the value 110.
ddV
Hence, P(F and A) = dVVV = 0.11
The alternative solutions for parts (a) and (b) using formulas are shown below:
Solution(a):
„(… .&† !)
P(A|F) = „(!)
ddVD ddV
dVVV
= WVV/dVVV = WVV = 0.275
Solution(b):
P(F and A) = P(F) X P(A|F)
WVV ddV ddV
= dVVV 𝑋 WVV = dVVV = 0.11
(c) Find the probability that a person selected at random from these 1000 persons is a
faculty member or agrees with the proposal.
Solution(c):
Using the addition rule for events that are not mutually exclusive, we have
P(F or A) = P(F) + P(A) – P(F and A)
WVV LYV ddV XUV
= dVVV
+ dVVV − dVVV = dVVV
= 0.57
(d) Find the probability that a person selected at random from these 1000 persons is either
neutral about or agree with the proposal.
Solution(d):
Using the addition rule for mutually exclusive events, we have
82
7. Bayes Theorem
Bayes’ Theorem is applied to revise probabilities of events after we obtain more information.
It is computed using the following formula:
𝑃(𝐴d )𝑃(𝐵|𝐴d )
𝑃(𝐴d |𝐵) =
𝑃(𝐴d )𝑃(𝐵|𝐴d ) + 𝑃(𝐴L )𝑃(𝐵|𝐴L )
Example 5.11
70% of working people like coffee and 30% like tea. Of those who drink coffee 90% added
sugar while 40% of those who drink tea added sugar.
% of People % who added sugar (S)
Coffee (C) 70% 90%
Tea (T) 30% 40%
A person added sugar to his drink. What is the probability that the drink is coffee?
Solution:
Given :
P(C) = 0.7 P(S|C) = 0.9
P(T) = 0.3 P(S|T) = 0.4
𝑃(𝐶)𝑃(𝑆|𝐶)
𝑃(𝐶|𝑆) =
𝑃(𝐶)𝑃(𝑆|𝐶) + 𝑃(𝑇)𝑃(𝑆|𝑇)
0.7 𝑋0.9
= = 0.84
(0.7𝑋0.9) + (0.3𝑋0.4)
A sample space, S, can be partitioned into 2 mutually exclusive events, A1 and A2.
83
The conditional probability formula (see section 5.2.2) is given by:
𝑃(𝐴d 𝑎𝑛𝑑 𝐵)
𝑃(𝐴d |𝐵) =
𝑃(𝐵)
8. Tree Diagrams
A tree diagram is useful for portraying conditional and joint probabilities. It is particularly
useful for analyzing business decisions involving several stages.
Example 5.12
Draw a tree diagram based on the information from Example 5.11
Solution:
1.00
- Circles are known as nodes and the lines are known as branches.
- The probabilities of each group of branches radiating from each node add up to 1.
- The joint probability is obtained by multiplying the simple probability with the
conditional probability.
9. Counting Rules
The multiplication formula indicates that if there are m ways of doing one thing and n ways of
doing another thing, then there are m X n ways of doing both.
Example 5.13
John takes a sandwich every morning. He can choose white bread or wholemeal bread together
with tuna, chicken or scrambled egg. How many possible ways are there for him to make his
sandwich?
Solution:
2 X 3 = 6 ways
84
9.2 Permutation
A permutation is any arrangement of r objects selected from n possible objects. The order of
arrangement is important in permutations.
n
&!
Pr =
(&g")!
where n = total number of objects
r = number of objects selected
Example 5.14
Find the number of ways to arrange 5 paintings on an exhibition wall. The 5 paintings are
chosen from a set of 7. The order of arrangement of these paintings is important.
Solution:
7
U!
P5 = = 2,520
(UgX)!
9.3 Combination
A combination is the number of ways to choose r objects from n objects without regard to
order.
n
&!
Cr =
"!(&g")!
Example 5.15
How many ways are there to choose 3 desserts from a menu of 10 desserts?
Solution:
10
dV!
C3 = = 120
T!(dVgT)!
85
10.1 Characteristics of a Discrete Probability Distribution
Solution(a):
Number of houses painted (x) P(x)
10 5/20 = 0.25
11 6/20 = 0.30
12 7/20 = 0.35
13 2/20 = 0.10
Total 1.00
Solution(b):
µ = å [x.P( x)]
[
s 2 = å (x - µ )2 .P( x) ]
= (0.25)(10-11.3) +(0.3)(11-11.3)2 + (0.35)(12-11.3)2 + (0.1)(13-11.3)2
2
= 0.91
86
(c) Find the probability of x > 10.
Solution(c):
P(x>10) = P(x=11) + P(x=12) + P(x=13)
Solution(d):
P(x<10) = 0 (from the probability distribution, there are no values below 10)
87
11. Discussion questions
3. An urn has 4 red and 5 blue marbles. 3 marbles are drawn without replacement. Find
the probability that
(a) All marbles are blue.
(b) The first is red and the other two are blue.
4. A car dealer advertised that for $99,999 you can buy a Model X, Y or Z car with a
choice of either leather seats or fabric seats. How many different arrangements of
models and seat types can the dealer offer?
5. There are 12 players in a basketball team. The Coach wants to pick five players
among the twelve.
(a) How many different groups are possible?
(b) Suppose that in addition to selecting the group, he must also rank each of the
players in that starting lineup according to their ability. How many groups are
possible?
6. The soccer team of a junior college plays 70 percent of their games at night and 30
percent during the day. The team wins 50 percent of their night games and 90 percent
of their day games. According to today’s newspapers, they won yesterday. What is the
probability the game was played at night?
88
7. BBS Clifford Branch has 800 customers. 240 of these customers have housing loans.
Of these 240 customers, 120 customers also own a credit card issued by the bank.
Altogether, 500 customers own a credit card issued by the bank.
(b) Find the probability that a selected customer owns a credit card issued by the
bank.
(c) Given that a selected customer owns a credit card issued by the bank, what is
the probability that the customer also has a housing loan with the bank?
(d) Find the probability that a selected customer does not have a housing loan and
does not own a credit card issued by the bank.
8. ABC bank has three key managers who help manage customers’ investment portfolios.
These managers try to make profits for customers; sometimes losses are incurred. The
data from a sample of 300 customers show the following:
Outcome
Incurred Losses Made Profits TOTAL
Manager A 45 90 135
Manager B 15 110 125
Manager C 10 30 40
TOTAL 70 230 300
(a) Find the probability that a customer selected at random made profits.
(b) Find the probability that a customer selected at random belongs to Manager A
or has incurred losses.
(c) Find the probability that a customer selected at random belongs to Manager C
and has made profits.
(d) Given that the customer selected belongs to Manager C. What is the probability
that the customer made profits?
(e) Given that the customer incurred losses, what is the probability that he/she
belongs to Manager A?
9. The number of cars sold by a salesman on a typical day is given by the following
distribution:
No of cars sold Probability
0 0.5
1 0.3
2 0.2
89
12. Supplementary questions
1. Two fair dice are thrown. What is the probability of at least one odd number? What
is the probability of this if four dice are thrown?
2. A university graduate club wants to attract more fresh graduates to join the club. A
survey was conducted amongst fresh graduates recently. 67% indicated that gaming
facilities were influential in their decision to join, 44% felt that dining facilities were
influential, and 21% felt that both were influential factors. Using the results of the
survey as the probabilities for a potential new member, what is the probability that the
potential new member would consider neither gaming nor dining facilities as influential
factors when making a decision to join the club.
3. An insurance company reported the following experience with damage-only claim for
automobile accidents
(a) What is the probability that a person selected at random from this group of
insured made a claim?
(b) What is the probability that a person selected at random is “26 or older” or made
no claim?
(c) What is the probability that a person made a claim given that this person is under
age of 26?
4. Suppose you are eating at Burger King with two friends. You have agreed to the
following rules on who should pay the bill. Each person will toss a coin. The person
who gets a result that is different from the other two will pay the bill. If all three tosses
yield the same result, the bill will be shared by all.
90
5. Given the following contingency table summarized by a company with 3 suppliers:
Delivery Outcome
Early On Time Late TOTAL
Jones 20 20 10 50
Smith 10 90 50 150
Robinson 0 10 90 100
TOTAL 30 120 150 300
6. Two production lines contribute to the total amount of a company’s products. Line A
provides 30% of the total products, and 15% of Line A’s products are defective. Line B
provides the remaining, and 5% of Line B’s products are defective.
Suppose a defective item was randomly selected from the total products. What is the
probability that this item was produced by Line A?
7. A new sports car model has defective brakes 15 percent of the time and a defective
steering mechanism 5 percent of the time. Let’s assume (and hope) that these problems
occur independently. If one or the other of these problems is present, the car is called a
“lemon”. If both of these problems are present, the car is a “hazard”. Your instructor
purchased one of these cars yesterday.
(a) What is the probability the first two shavers will be returned to the store
because they are defective?
(b) What is the probability that the first two shavers will not be defective?
9. Hassan and Wendy appear for an interview for two vacancies in the same post. The
d d
probability of Hassan’s selection is and that of Wendy’s selection is . Find the
U X
probability that
(a) Only one of them will be selected.
(b) Both of them will be selected.
(c) None of them will be selected.
(d) At least one of them will be selected.
91
10. The table below presents a sample of employees surveyed regarding their loyalty to
XYZ Pte Ltd
Length of Service
Loyalty
<1 1- 3 3-6 6-9
Year, Years, Years, Years, > 9 Years,
B1 B2 B3 B4 B5 Total
Would remain, A1 18 11 5 28 12 74
Would not remain, A2 7 10 6 13 10 46
25 21 11 41 22 120
(a) State the probability of selecting an employee with more than 9 years of service.
(b) State the probability of selecting an employee who would not remain with the
company, given that he or she has more than 9 years of service.
(c) State the probability of selecting an employee with more than 9 years of service
or one who would not remain with the company.
11. DBS Bank reports that 66 percent of its customers maintain a checking account, 83
percent of its customers have a savings account, and 53 percent have both. Assuming
that a customer is being chosen randomly, state the probability of selecting a customer
who has either a checking account or a savings account. Also, state the probability of
selecting a customer who has neither a checking account nor a savings account.
12. Daniel takes 5 types of health supplements each morning. He changes the sequence of
consumption each day. How many possible sequences are there for Daniel to consume
these health supplements?
13. You got free tickets to watch the National Day Parade and you can bring along three
friends. However, you have five friends who want to come along. How many groups
of different friends can you take with you?
14. In how many ways can a panel of judges award the 1st, 2nd and 3rd prize among 12 senior
citizens participating in a singing competition?
15. A newly opened cafe provides value set meals at $5 only. Customers can select a main
course (chicken or fish), one type of vegetable (broccoli, spinach or cauliflower), one
side dish (soup, corn or scrambled egg) and one drink (coffee or tea). How many
different meal arrangements are possible?
92
16. A random variable takes the value 0, 1 and 4 according to the following distribution:
X P(X)
0 0.2
1 0.4
4 0.4
Total 1.00
93
BUSINESS STATISTICS
SESSION 6
This is a continuation of Session 4 where we have learnt how to use Excel to perform various
data/statistical analysis.
94
EXCEL LAB EXERCISE 3
1. Health care is an important issue to many people including the government. Researchers
recently conducted a survey of citizens over 60 years of age whose net worth was too
high to qualify for government medical insurance and who have no private health
insurance. The ages of 25 uninsured senior citizens were as follows:
60 61 62 63 64 65 66 68 68 69 70 73 73
74 75 76 76 81 81 82 86 87 89 90 92
(a) Find the mean and the sample standard deviation of the ages of the uninsured
senior citizens.
(b) Set up a frequency distribution (including relative frequency) as shown below:
75 76 83 91 80 77 84 81 80 73
(a) Find the mean.
(b) Find the median and mode.
(c) Determine the standard deviation.
3. The following shows the net profits($) of 12 branches of Everfresh Florist Shop on
Mother’s Day.
903 1745 3883 863 1204 1624
1698 957 1041 1138 1354 1802
Determine the:
95
4. Western Digital Media has engaged an independent market consultant to conduct a
survey on whether there is any relationship between the sales of its products and the
advertising expenditures. The data are collected as follows:
5. The local ice cream shop keeps track of how much ice cream they sell versus the noon
temperature on that day. Here are their figures for the last 12 days:
96
6. The blood groups of 20 patients are listed below:
97
EXCEL LAB EXERCISE 3 (ANSWERS)
1(a) 74.04
9.7446
(b)
Age (years) Frequency Relative Frequency
60 up to 65 5 20%
65 up to 70 5 20%
70 up to 75 4 16%
75 up to 80 3 12%
80 up to 85 3 12%
85 up to 90 3 12%
90 up to 95 2 8%
Total 25 100%
(c)
HIstogram
6
Frequency
0
1 2 3 4 5 6 7
Age (years)
2(a) 80
(b) 80, 80
(c) 5.228
3(a) 1517.67
(b) 1279
(c) 819.558
40
20
0
0 1 2 3 4 5
Advertising ($000)
98
5(a) Independent (X) : Temperature
Dependent (Y) : Sales ($)
(b)
Temperature vs Sales
800
Sales ($000)
600
400
200
0
0 5 10 15 20 25 30
Temperature (degree celsius)
6(a)
Count of Blood
Row Labels Group
A 3
AB 4
B 5
O 8
Grand Total 20
(b)
Piechart
15%
40%
20%
25%
A AB B O
(c) 20%
(d) 40%
(e) 40%
99
BUSINESS STATISTICS
SESSION 7
_________________________________________________________________
1. Introduction
Research studies involving data rarely focus on a single factor (variable). Most of the time, we
wish to know the relationships or associations between two or more variables. In this topic, our
attention will be focused on the relations between two quantitative variables. In such cases, two
variables will be recorded for each sampling unit at a particular point in time. These paired data
are called ‘bivariate data’.
2. Scatter Diagram
A scatter diagram or scatter plot is helpful in detecting a relationship between two variables. It
shows the paired values of two variables and is constructed by using the horizontal axis as the
independent variable (X) and the vertical axis as the dependent variable (Y).
Ideally, once a scatter diagram is drawn, we try to “best fit” a line through it in such a way that
it indicates to us the type of relationship between the two variables.
Example 7.1
The following chart (Table 7.1) shows the number of sales calls made by bank relationship
managers and the number of investment products that they have sold over the past year. Draw
a scatter diagram.
Number of Number of
Manager
sales calls products sold
Ali 150 70
Joe 100 40
Maria 50 60
Lina 50 30
Sue 150 40
Herbert 100 50
Bernard 40 30
Donald 200 70
Table 7.1 Sales Data
100
Solution:
We are examining whether the number of investment products sold (dependent variable) is
affected by the number of sales calls made (independent variable). We draw the dependent
variable (y) on the vertical axis and the independent variable (x) on the horizontal axis.
Thereafter, plot the points for each pair of x and y values.
No of Sales Calls vs No of
Products Sold
80
No of Products Sold
60
40
20
0
0 50 100 150 200 250
No of Sales Calls
Figure 7.1
The diagram indicates a direct relationship between number of sales calls made and the
number of products sold.
Once a scatter diagram is drawn, we look for trends. Is it possible to have a line that best fits
most of the points? Let’s identify the various types of relationships between the variables.
Direct: As x increases, y also increases. For example, more advertising expense leads to higher
sales.
Inverse relationship: As x increases, y decreases and vice versa. For example, when a car gets
older (i.e. age of car increases), the selling price declines.
No Relationship
This occurs when we are unable to identify any obvious trend in the relationship between the
two variables.
101
Figure 7.2 shows the various relationships between variables. Our focus will be on “linear”
relationships between two variables.
3. Regression Analysis
In regression analysis, we try to find a line that best fits the points in the scatter diagram. The
least squares method provides us such a line.
102
3.1 Least Squares Regression Equation
The formulas which are used to calculate a and b based on the least squares regression principle
are:
𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑏=
𝑛 ∑ 𝑋 L − (∑ 𝑋)L
𝑎 = 𝑌’ − 𝑏𝑋’
where
X = values of the independent variable
Y = values of the dependent variable
n = no of pairs of data values
Example 7.2
AA Research trains research assistants to conduct surveys through face-to-face interviews.
Recently, the firm has undertaken a research project. The figures below show the number of
weeks that these research assistants have worked for the firm and the number of face-to-face
interviews conducted by each research assistant on a given day.
No of
Research Experience
interviews
Assistant (weeks)
conducted
1 15 4
2 41 9
3 58 12
4 18 6
5 37 8
6 52 10
7 28 6
8 24 5
9 45 10
10 33 7
103
(a) Find the least squares regression equation using weeks of experience as the
independent variable and number of interviews conducted as the dependent variable.
Solution:
X Y X2 Y2 XY
15 4 225 16 60
41 9 1691 81 369
58 12 3364 144 696
18 6 324 36 108
37 8 1369 64 296
52 10 2704 100 520
28 6 784 36 168
24 5 576 25 120
45 10 2025 100 450
33 7 1089 49 231
∑ 𝑋 =351 ∑ 𝑌 =77 ∑ 𝑋 L =14141 ∑ 𝑌 L =651 ∑ 𝑋𝑌 =3018
𝒏(𝚺𝑿𝒀)g(𝚺𝑿)(𝚺𝒀) 𝟏𝟎(𝟑𝟎𝟏𝟖)g(𝟑𝟓𝟏)(𝟕𝟕)
𝒃= 𝒏(𝚺𝑿𝟐 )g(𝚺𝑿)𝟐
= 𝟏𝟎(𝟏𝟒𝟏𝟒𝟏)g(𝟑𝟓𝟏)𝟐
= 0.173
𝟕𝟕 𝟑𝟓𝟏
z − 𝒃𝑿
𝒂=𝒀 z= − 𝟎. 𝟏𝟕𝟑 › œ = 𝟏. 𝟔𝟐𝟖
𝟏𝟎 𝟏𝟎
• = 𝟏. 𝟔𝟐𝟖 + 𝟎. 𝟏𝟕𝟑𝑿
Regression equation : 𝒀
Solution:
Interpretation of ‘a’
The Y-intercept (i.e. the value of ‘a’) represents the value of Y when X equal zero. With no
work experience, we expect the number of interviews conducted to be about 1.628.
We should, however, be very careful while making this interpretation of a. In our sample of
ten research assistants, the weeks of experience varies from 15 to 58. Since x=0 is outside this
range of experience, the prediction usually will not hold true. This will be explained further
in section 3.2.
Interpretation of ‘b’
The value of b is 0.173. ‘b’ refers to the gradient or slope of the regression line. It gives the
change in y (dependent variable) due to a change of one unit in x (independent variable). In
this example, we can state that on average, a one week increase in work experience will
increase the number of interviews conducted by 0.173 units.
104
3.2 Prediction Using Linear Regression
A regression equation allows us to predict the y value for any given x value.
Interpolation
This is to use the regression line, 𝑌ž = 𝑎 + 𝑏𝑋 to find the estimated value of y for any given value
of x that lies within the data set.
Extrapolation
This is to use the regression line, 𝑌ž = 𝑎 + 𝑏𝑋 to find the predicted value of y for a given value of x
that lies beyond the range of the data set. Extrapolation must be used with caution. This is explained
in Section 5.
Example 7.3
From the regression equation obtained from Example 7.2, predict the number of interviews
conducted by a research assistant with 20 weeks of work experience.
Solution:
4. Correlation Analysis
A number that indicates the direction and strength of the linear relationship between an
independent variable (X) and a dependent variable (Y). In other words, r measures how closely
the points in a scatter diagram are spread around the regression line.
𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑟=
k[𝑛 ∑ 𝑋 L − (∑ 𝑋)L ][𝑛 ∑ 𝑌 L − (∑ 𝑌)L ]
105
Strong, positive linear correlation Weak, positive linear correlation
(r is close to 1) (r is positive and close to about 0.5)
Example 7.4
Based on the data from example 7.2, compute the correlation coefficient.
Solution:
𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑟=
k[𝑛 ∑ 𝑋 L − (∑ 𝑋)L ][𝑛 ∑ 𝑌 L − (∑ 𝑌)L ]
10(3018) − (351)(77)
=
¢[10(14141) − (351)L ][10(651) − (77)L ]
= 0.969
Interpretation of ‘r’
The correlation coefficient is 0.969. Since the value is close to 1, it indicates a strong,
positive correlation.
How good is a regression model? In other words, “How well does the independent variable
explain the dependent variable in the regression model”. The coefficient of determination
answers this question.
106
The coefficient of determination, denoted by r2 represents the proportion of variation in the
dependent variable (y) that is explained by the variation in the independent variable (x). The
value of r2 ranges from 0 to 1.
Example 7.5
Further to Example 7.4, compute the coefficient of determination and interpret the value.
Solution:
r2 = (0.969)2 = 0.939
Interpretation of ‘r2’
About 93.9% of the variation in the number of interviews conducted (the dependent variable)
is explained by the variation in weeks of experience (the independent variable).
5. Cautionary Notes
We should apply linear regression with caution. Here are some points to watch out for:
Non-linear relationships
We have only seen how to use a straight line to model the best fit. Sometimes, the relationship
may not be linear. Hence, it is good to construct a scatter diagram and look at the plot before
we use simple linear regression
Extrapolation
Linear regression equation is established according to the set of data collected. If you use the
estimated regression line for prediction using values which lie outside the range of original data
collected, the estimates may be inaccurate. This is known as extrapolation.
For example, the value of x in our example on number of weeks of experience and number of
interviews conducted vary from 15 to a maximum of 58 weeks. Hence, our estimated
regression line is only applicable for value of x falling within these values. If we predict y for a
value of x either less than 15 or greater than 58, it is called extrapolation. We should interpret
such prediction cautiously and not attach much value to them.
For example, data in a particular country showed an increase in car sales as well as increase in
sale of new homes. One does not cause the other, the cause is probably related to higher
incomes which is a third variable that is not included in the study.
Units of measurement
Be careful about the units of measurement used to obtain the regression equation e.g. 000s or
millions. For example, if X is the advertising expenditure and the value of the original x data
is in 000s, then a value of $10,000 would mean x =10 and not 10,000.
107
6. Performing Regression/Correlation Analysis using Excel
Input the following data (Figure 7.5) showing credit card limit ($000) and average monthly
spending($) of customers.
Figure 7.5
Figure 7.6
Select the input range for the dependent variable (Y) and independent variable (X):
Figure 7.7
108
The following excel output (Figure 7.8) will be generated
Y-intercept (a)
Slope or Gradient (b)
Figure 7.8 Regression Excel Output
• Correlation coefficient (r) = 0.756 (note: this value is displayed without the direction sign).
It will be positive if the slope (b) is positive. If the slope (b) is negative, then r will be
negative. In this case, r is positive.
109
7. Discussion questions
1. The sales($mil) and advertising ($mil) data were collected from XYZ Co for a sample
of 4 months.
(a) Using Advertising Expense as the independent variable and Sales as the dependent
variable, draw the scatter diagram. Does the diagram indicate a relationship between
the two variables?
(e) Interpret the values of the gradient “b” and the y-intercept “a”.
2. The manufacturer of Home Exercise Machine is studying the relationship between the
number of months that machine has been purchased and the usage time in the last one
week. A phone survey produced the following results:
Customer ID Months from purchased date (X) Hours used last week (Y)
891 4 7
832 6 0
621 9 1
319 9 2
756 2 6
753 6 0
669 7 2
900 4 7
764 6 4
428 6 0
110
A regression analysis produced the following partial output:
Regression Statistics
Multiple R ?
R Square 0.4518
Adjusted R Square 0.3832
Standard Error 2.2656
Observations 10
(a) From the statistical output given or otherwise, determine the coefficient of
determination and interpret the result.
(b) Compute the coefficient of correlation (use 3 dec pl) and interpret the result.
(c) Write the regression equation and interpret the meaning of the slope.
(d) Use the regression equation to predict the number of hours used in the last one week for
a customer that has bought the machine 5 months ago.
(e) If a customer has bought a machine 24 months ago, can we still use the regression
equation to predict the number of hours used in the last one week by the customer? Why
or why not?
3. The following table and chart show a sample of 10 companies in the restaurant
industry with their respective number of employees and annual profits.
1320 11,880,000
721 6,489,000
667 5,336,000
902 9,020,000
753 7.530,000
1396 11,168,000
1219 12,190,000
727 6,543,000
675 5,400,000
609 4,872,000
111
(a) Describe the relationship between the two variables from the scatter plot.
(b) Write down the regression equation using the statistical output. (use 3 dec pl)
(d) Name two other quantitative variables that may have a relationship with the
annual profits of a firm.
(f) Explain how annual profit changes for every additional employee.
(g) If a firm has 800 employees what is the expected amount of annual profit?
(h) If the firm has 3000 employees, can we still use the regression equation to
predict the annual profit? Explain.
(i) What is the expected change in annual profits if a firm reduces the number of
employees by 200?
(j) A firm has annual profits of $8 million, how many employees would you expect
this firm to have? (Round answer to nearest whole number)
112
8. Supplementary questions
1. Food Shop employs several sales representatives who call retail grocery outlets for the
purpose of merchandising the company’s food products. Sarah, the sales director
wishes to determine the relationship between the number of calls a sales representative
makes to a given retail outlet and the amount of the company’s food products purchased
by the outlet. She selected 5 retail outlets at random and obtained the following
information:
Retail Grocery Number of sales calls during one Monthly Sales to Outlet
Outlet month ($000)
A 7 73
B 6 68
C 5 60
D 3 45
E 4 54
(a) Determine the linear regression equation using the least square method, with the number
of sales calls as the independent variable.
(c) Henry, a sales representative, makes 6 sales calls to an outlet (known as Outlet K) during
the month. Estimate the monthly sales to Outlet K.
2. The table below shows the data on incomes and food expenditures of seven households.
113
(f) Predict the food expenditure for a household with income of $3,100.
(g) Would you use the least squares line in part (d) to predict the food expenditures of a
household with income of $5,500? Justify your answer.
3. For a particular car brand, you wish to study the relationship between the age of a car
and its selling price. Listed below is a random sample of 12 used cars sold during the
last year.
Car Age (X) Selling Price in $000 (Y)
1 9 8.1
2 7 6.0
3 11 3.6
4 12 4.0
5 8 5.0
6 7 10.0
7 8 7.6
8 11 8.0
9 10 8.0
10 12 6.0
11 6 8.6
12 6 8.0
(a) If we want to estimate selling price on the basis of the age of the car, which variable is
the dependent variable and which is the independent variable?
114
4. The management of Hello Electronics wants to investigate the relationship between the
years of experience and the number of units of Product M assembled by its employees
working in the assembly department. The management took a sample of seven employees
from the assembly department and observed them for a week. The following table gives
data on the years of experience for these employees and the number of units of Product
M each of them assembled per day.
Experience 5 11 15 7 2 10 9
No of units assembled 14 21 20 18 13 16 18
Given: Sx = 59 Sy =120 Sx2 = 605 Sy2 =2110 Sxy = 1075
(a) Find the least squares regression equation with experience as the independent variable
and units assembled as the dependent variable.
(b) Give a brief interpretation of the values of “a” and “b” calculated in part (a).
(d) Estimate the number of units of Product M assembled per day by a worker with 25 years
of experience. Comment on this finding.
5. A researcher would like to investigate the relationship between flight hours on a nonstop
trip and the one-way airfare being charged for economy class in short-haul flights within
Asia on a non-peak weekday.
115
A regression analysis produced the following output:
Regression Statistics
Multiple R 0.817506136
R Square 0.668316283
Adjusted R
Square 0.626855818
Standard Error 231.3501891
Observations 10
Standard
Coefficients Error t Stat P-value
Intercept -20.4233154 162.136052 -0.12596406 0.90286852
Flight hours 131.400885 32.7283682 4.01489266 0.00386858
(b) Write the regression equation. (Express the coefficients in equation to 3 decimal places)
(c) Determine the correlation coefficient and interpret the result. (Express your answer to
3 decimal places)
(d) Determine the coefficient of determination and interpret the result. (Express your
answer to 3 decimal places)
(e) Using the regression equation, estimate the expected airfare for a flight that takes 5
hours. (Express your answer to 2 decimal places)
(f) What is the expected change in the airfare for a flight that takes 2 hours longer.
(Express your answer to 2 decimal places)
(g) Based on the information above, can we estimate the airfare for flight from Singapore
to London that takes at least 13 hours? Explain.
116
6. The Traffic Police issues demerit points to motorists who have committed traffic
violations on the road so as to identify high-risk motorists or habitual traffic offenders.
A study was carried out to investigate whether years of driving experience affects the
number of demerit points chalked up by motorists over the last two years.
(b) Set up a scatter diagram and label the axes clearly. Does the diagram indicate a
relationship between the 2 variables?
(c) Given that the linear regression equation is 𝑌ž = 12.124 − 0.302𝑋, state and interpret
the value of the slope.
(d) Determine the coefficient of correlation and interpret the value. (Express your answer
to 3 decimal places).
(e) Kareem has 9 years of driving experience while his father has 29 years of driving
experience. How many more or less demerit points would you expect Kareem to have
over the last two years? (Express answer as a whole number)
(f) Mr Gan has chalked up 11 demerit points over the last two years. How many years of
driving experience would you expect Mr Gan to have? (Express answer as a whole
number)
117
BUSINESS STATISTICS
SESSION 8
___________________________________________________________________________
1. Introduction
The normal distribution is the most important and most widely used of all the probability
distributions. A large number of phenomena in the real world tend to be normally distributed
or are approximately normally distributed. Continuous variables like height, weight, scores in
an examination, lifespan of an electronic item etc. usually follow approximately to a normal
distribution.
The normal probability distribution, when plotted gives a bell-shaped curve such that
Normal distribution curves can have the same mean but different standard deviations (see
Figure 8.2) or different means but the same standard deviations (see Figure 8.3). A larger
118
standard deviation (s) results in a wider and flatter normal curve, which indicates more
variability or dispersion among the data.
Figure 8.2 Normal Distributions with same mean but different standard deviations
Figure 8.3 Normal Distributions with different means but same standard deviations
Assume we have a random variable, X, which is normally distributed with mean µ and the
standard deviation s. If we want to find the probability of x lying within certain values, we
will be finding the area under the curve that covers the values ranging from a to b to give us
P(a £ X £ b). (see Figure 8.4).
The calculation of the probability (or area) involves more complex calculus and probability
density functions. Here, we will see how to find this probability – that is, the area under normal
curve, using a statistical table called the standard normal table.
Since a normal curve is specified by the mean (µ) and the standard deviation (s), there would
be a different normal curve for each possible pair of µ and s. That is, we need to have one
119
normal table for each normal curve to find out the probabilities. In order to overcome the
problem of having limitless number of normal distribution tables, we can “standardize” the
normal curve by expressing the original values of normal random variables in terms of ‘number
of standard deviations away from mean’.
With this approach, we can use one standard normal table for all normal curves. We refer to
this ‘one’ normal curve as the standard normal distribution. The standard normal distribution
has a mean of 0 and a standard deviation of 1.
Figure 8.5 displays the standard normal distribution curve. The units for the standard normal
distribution curve are denoted by z. We can call these units z values or z scores.
Note that the values of z on the left side of the mean are negative. However, the probability or
the area under the curve is always positive. A point with a z-value of 2 means that the point is
two standard deviations to the right of the mean. Similarly, a point with a z-value of -2 means
that the point is two standard deviations to the left of the mean.
How do we obtain the z-value for varying values of x? We note that all intervals containing
the same number of standard deviations from the mean will contain the same proportion of the
total area under the curve for any normal random variable, X.
where
z = number of standard deviations from x to µ
x = value of the random variable
µ = mean of the distribution of this random variable
s = standard deviation of this distribution
Example 8.1
The monthly incomes of security officers follow the normal distribution with a mean of $1,500
and a standard deviation of $300.
(a) What is the z value of an officer who earns $1,200 per month?
(b) What is the z value of an officer who earns $1,900 per month?
120
Solution:
Qgh
(a) For x = 1200, 𝑧 = ¤
dLVVgdXVV
= TVV
= -1
Qgh
(b) For x = 1900, 𝑧 = ¤
d¥VVgdXVV
= TVV
= 1.33
As noted earlier, a z-value shows the distance between a particular value of x and the µ in terms
of number of standard deviations from the mean. The table in Appendix 2 (The Standard
Normal Table) lists the areas or probabilities for this standardized distribution. Figure 8.6
shows a portion of these probabilities.
We shall now apply the standard normal distribution to find the area (probability) in a normal
distribution for varying values of x.
Example 8.2
The monthly incomes of security officers follows the normal distribution with a mean of $1500
and a standard deviation of $300. What is the probability that an officer earns
(a) between $1,500 to $1,800 per month?
(b) more than $1,200 per month?
(c) less than $1,200 per month?
(d) between $1,800 to $2,000 per month?
121
Solution (a):
dYVVgdXVV
P(1500 £ x £1800) = P(0 £ z £ TVV
)
= P( (0£ z £ 1.00)
= 0.3413
The probability associated with a z of 1.00 is available from Appendix 2. To locate the
probability, go down the left column to 1.0 and then move horizontally to the column heading
0.00 (see Figure 8.7)
Solution (b):
dLVVgdXVV
P( x >1200) = P(z > TVV
)
= P(z > -1.00)
= 0.3413 +0.5
= 0.8413
Recall that half the area of a normal curve is above the mean. So, the probability of selecting
an officer earning above $1,200 is obtained by adding two areas, that is, 0.3413 + 0.5.
122
Solution (c):
dLVVgdXVV
P( x <1200) = P(z < TVV
)
= P(z < -1.00)
= 0.5 – 0.3413
= 0.1587
Since half the area under the curve is 0.5 and the area between $1,200 and $1,500 is 0.3413,
the probability of x being less than $1,200 is (0.5 – 0.3413).
Solution (d):
dYVVgdXVV LVVVgdXVV
P(1800 £ x £ 2000) = P( TVV
£ z £ TVV
)
= P(1.00 £ z £ 1.67)
= 0.4525 – 0.3413
= 0.1112
The situation is again separated into two parts. The probability of salaries lying between $1,500
to $1,800 is 0.3413. The probability of salaries lying between $1,500 to $2,000 is 0.4525.
Thus, the probability of salaries lying between $1,800 to $2,000 is (0.4525 – 0.3413).
We will now do a reverse procedure where we find the corresponding value of z or x when an
area under a normal curve is known.
Example 8.3
Find the value of z such that the area under the curve between 0 and z is 0.3888.
Solution:
123
To obtain the z value, we have to locate 0.3888 in the body of the standard normal table. Then
we read the numbers in the z column and the header to obtain the z value of 1.22 (see Figure
8.8).
Example 8.4
Applicants for a particular job are required to sit for an aptitude test. The test scores are
normally distributed with a mean of 40 and a standard deviation of 7. Hilary is going to sit for
this test soon. What should her score be so that only 15% of all who sit for this test score higher
than she does?
Solution :
Let x represent the test scores of all job applicants. We wish to find the value of x such that the
area under the curve to the right of x is 15%.
If Hilary scores 47.28 on the test, only about 15% of job applicants are expected to score higher
than she does.
124
4. The Sampling Distribution
For any population data set, there is only one value of the population mean, µ. However, when
we deal with samples, we would expect different samples of the same size drawn from the same
population to yield different values of the sample mean, 𝑥̅ . Like any other random variables,
the sample mean, 𝑥̅ possesses a probability distribution which is called the sampling distribution
of 𝑥̅ .
Hence, we have this definition “The sampling distribution is the probability distribution of a
sample statistic, that is the sample mean.” It results from the drawing of all possible samples
of a given size from the population regarding a sample statistic.
4.1 z
Mean and Standard Deviation of 𝒙
The mean and standard deviation of the sampling distribution of 𝑥̅ are denoted by 𝜇Q̅ and 𝜎Q̅ .
The standard deviation of 𝑥̅ is known as the standard error (𝜎Q̅ ).
If we take all possible samples (of the same size) from a population and calculate their means,
you will find that the mean (𝜇Q̅ ) is always equal to the population mean, µ. This can be proven
with a simple example below:
Example 8.5
The following data give the years of experience for all four employees of a small company.
The random variable, X = Years of experience.
Employee X
Mark 1
Frank 1
Dawn 3
Sue 5
We now list all the possible samples of size 2 (n=2) from this population.
125
We shall calculate the mean for this sampling distribution of means, that is, we obtain
an average of all the sample means.
dILITILITIW
𝜇Q̅ = Z
= 2.5 which is exactly equal to µ. (proven)
The standard deviation of the sampling distribution is given by the formula below:
𝜎
𝜎Q̅ =
√𝑛
We call this the Standard Error.
If the original population is normally distributed, the sampling distribution of sample means
will also be normal, whatever the value of the sample size (n).
However, if the original population is NOT normally distributed, then the sampling distribution
of sample means will be approximately normal only for large sample sizes (n ³ 30). The
distribution approaches normal distribution as the sample size n increases. This is known as
the Central Limit Theorem. (See Figure 8.9)
126
5. Computing Probabilities for a Sample Mean
When the sampling distribution is normally distributed, we are able to find the probability of
a sample mean taking on certain values within a specified range. We will need to find the
corresponding z values for values of 𝑥̅ in order to use Appendix 2. (Standard Normal Table).
𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
Example 8.6
(Large sample, non-normal population distribution)
The mean rent paid by all retail shops in a large city mall is $950 with a standard deviation of
$225. However, the population distribution of rents for all retail shops in this city mall is
skewed to the right. A sample of 100 shops was taken.
(a) Will the sampling distribution of 𝑥̅ be normal? Explain.
(b) Find the probability that the mean rent exceeds $990.
Solution (a):
Although the population distribution of rents paid by all retails shops is not normally
distributed, the sample size 100 is large (n³ 30). Hence, the Central Limit Theorem (CLT) can
be applied to infer the shape of the sample distribution of 𝑥̅ . The sampling distribution based
on CLT will be normal.
Solution (b):
𝜇Q̅ = 𝜇 = 950
¥¥Vg¥XV
𝑃(𝑥̅ ≥ 990) = 𝑃(𝑧 ≥ LLX )
D
√dVV
= P(𝑧 ≥ 1.78)
= 0.5 – 0.4625
= 0.0375
127
Example 8.7
(Small sample, normal population distribution)
Upper primary school children have allowances that are approximately normally distributed
about a mean of $39 per week and a standard deviation of $2. A random sample of 25 children
is taken and the mean is calculated. What is the probability that this mean value will be between
$38.50 and $40?
Solution:
Although the sample size is small (n<30), the shape of the sampling distribution of 𝑥̅ is
normal because the population is normally distributed.
TY.XgT¥.V WV.VgT¥.V
𝑃(38.5 ≤ 𝑥̅ ≤ 40) = 𝑃( LD ≤𝑧 ≤ LD )
√LX √LX
= P(-1.25 ≤ 𝑧 ≤ 2.50)
= 0.3944 + 0.4938
= 0.8882
128
6. Discussion Questions
2. Assume the distribution of monthly food expenditures for a family of four follows the
normal distribution, with a mean of $490 and a standard deviation of $90.
(a) What is the probability that a selected family spends less than $430 on food?
(b) What is the probability that a selected family spends between $500 to $600 on
food?
(c) It is known that 10% of families spent below $X. Find the value of X.
(d) Is it likely for a selected family to spend more than $800 on food? Justify.
4. Assume that the number of weekly study hours for students at a certain university is
approximately normally distributed with a mean of 22 and a standard deviation of 6.
(a) Find the probability that a randomly chosen student studies less than 12 hours.
(b) A certain lecture group consists of 225 students. You may assume that this
group forms a simple random sample from the students in the university. Find
the probability that the average number of study hours is between 21 and 23
hours.
5. The transport claims made by marketing managers of Alliance Global follow a normal
distribution with a mean of $490 per month and standard deviation $80.
(a) What is the probability that a randomly selected manager has a transport claim
of more than $600?
(b) Suppose a sample of 49 managers was selected. What is the probability that the
mean transport claim is greater than $470?
129
7. Supplementary Questions
1. The average return achieved by people who invested in Real Estate Investment Trusts
or REITS is normally distributed with mean 9 percent and standard deviation 1.2
percent. Find the probability that a randomly selected investor achieved a return of
(a) more than 10 percent
(b) between 8 to 9.5 percent.
2. The life of a Model J7 electric shaver has a normal distribution with mean 65 months
and standard deviation of 6 months. The company is providing a warranty period such
that it does not replace more than 1% of the shavers. What is the warranty period
(months)?
3. YCH Logistics pays its part-time employees an average wage of $6.40 an hour with a
variance of $0.64. If the wages are approximately normally distributed,
(a) What percentage of the employees receive wages between $5.50 to $6.60?
(b) The bottom 20% of employees receive wages less than $X an hour. What is
the value of X?
(c) What is the probability that a sample of 36 employees will have a mean wage
of less than $6.10 an hour?
5. Assume that the weights of all packages for a certain brand of chocolate bar are normally
distributed with mean of 32 grams and a standard deviation of 0.3 grams. Find the
probability that the mean weight of a random sample of 20 packages of this brand of
chocolate bar will be between 31.8 to 31.9 grams.
6. The time taken to learn a major sewing job for a new worker hired in the production
department of a garment factory is normally distributed with a mean of 80 hours and a
standard deviation of 6 hours. Find the probability that the mean time taken to learn
this job by a random sample of 16 new workers would be
(a) between 76 and 78 hours.
(b) within 4 hours of the population mean
(c) more than the population mean by at least 3.5 hours.
130
7. The amount of monthly car parking charges incurred by car drivers in a town have a
skewed distribution with a mean of $65 and a standard deviation of $25. Find the
probability that the mean amount of car parking charges for a random of sample of 75
drivers selected from this town will be
(a) more than $70.
(b) between $58 and $63.
(c) less than the population mean by at least $5.
8. A Company, which manufactures dispensing machines for hot beverages, sets the fill
level at 197.5cc. The filling process gives a standard deviation of 5cc. The fill levels are
normally distributed.
(a) What is the probability that a randomly selected drink contains less than 190cc?
(b) What is the probability that a random sample of 50 drinks has a mean value
greater than 199cc?
(c) The company claims that an average drink is 200cc. What percentage of the
sample means are 200cc or more if samples of size 36 are taken?
9. The final scores of a management module follow the normal distribution. The mean of
the distribution is 74 and the standard deviation is 5. The professor wishes to award an
A grade only to the students whose scores belong to the highest 3%. Calculate the
dividing point for those students who earn an A grade and those who do not.
10. A machine is programmed to fill up bags of cement for industrial use. The amount filled
up per bag is normally distributed with mean 15 kg and a standard deviation of 1.5 kg.
(a) What percent of the bags contain between 15 to 16.4 kg?
(b) Bags that contain less than 13 kg are considered to be under-filled and will be
rejected for sale. What percentage of bags are under-filled?
131
BUSINESS STATISTICS
SESSION 9
ESTIMATION
___________________________________________________________________________
1. Introduction
Statistical inference is the process of using sample results to draw conclusions about the
characteristics of a population. In this chapter, we shall examine statistical procedure that will
enable us to estimate the true population mean (µ) and population proportion (π). This
procedure is known as Estimation.
There are two types of estimates namely a point estimate and an interval estimate that can be
used to estimate the true population parameter.
If we select a sample and compute the value of the sample statistic for this sample, this sample
value gives the point estimate of the corresponding population parameter. For example, you
take a sample of students from a university and found that the mean travelling time to the
university for this sample is 45 minutes. Then, using the sample mean (𝑥̅ ) as a point estimate
of the population mean (µ), we can say the mean travelling time for all students is about 45
minutes.
Hence, we have the following commonly used point estimators to estimate population
parameters.
132
2.2 Interval Estimate
When we do a point estimate, we can never be sure that 𝑥̅ , which is based on sample data is
equal to µ. A point estimate is insufficient for making reliable inferences about the population
parameter. Hence, we should construct an interval estimate which serves the purpose of making
inferences better.
In the case of interval estimate, instead of using a single value, we use a range of values within
which the true value of the population parameter is likely to be included. For example, instead
of saying that the mean travelling time is 45 minutes, we could subtract and add 15 minutes to
45 minutes and then say that the mean travelling times ranges from 30 minutes to 60 minutes.
This is known as an interval estimate.
The question arises, what number should we subtract from and add to a point estimate in order
to obtain the interval estimate? The width of this estimate depends on two considerations:
¤
• The standard error ( ) and
√&
• The required level of confidence e.g. 95%
We always attach a probabilistic statement to the interval estimate. This statement is given by
the confidence level, for example, 95%. An interval that is constructed based on this confidence
level is called a confidence interval. Although any value of confidence level can be chosen,
the commonly used ones are 90%, 95% and 99%. In general, the level of confidence is
symbolized by (1- a) where a is the proportion in the tails of the distribution which are outside
the confidence interval. See Figure 9.1
If we select all possible samples of size n, and if we calculate the confidence interval for each
of these samples, then 95% of such intervals will contain the true population mean.
A sample size is considered large when n is 30 or larger. When the sample size is large, we
will use the z distribution to construct a confidence interval for µ. The confidence interval for
µ is
𝜎
𝑥̅ ± 𝑧
√𝑛
133
¤
is known as the standard error of the mean
√&
¤ ¤
𝑧 is known as the margin of error, E (i.e. E = 𝑧 )
√& √&
¤
Note: The formula can also be written as 𝑥̅ ± 𝑧¯DL . We will leave out writing the subscript
√&
𝛼D in this coursebook.
2
In practice, we often do not know the value of the population standard deviation, 𝜎. Whenever
𝜎 is unknown, it can be estimated by the sample standard deviation, s. The confidence interval
for µ will then be
𝑠
𝑥̅ ± 𝑧
√𝑛
The steps to obtain the z value for any given confidence level (for example 95%) are:
(a) Divide 0.95 by 2 which gives 0.4750
(b) Refer to Appendix A2 to look for 0.4750 in the body of the table and then record the
corresponding z value. This value is 1.96 for a 95% confidence. See Figure 9.2
Example 9.1
The standard deviation of the length of J8 stainless steel bolts produced by a machine is known
to be 4.5 mm. For a simple random sample of 36 bolts, the average length is 48.4 mm. What
is the 90% confidence interval for the mean length of bolts produced by the machine?
Solution:
n=36 𝑥̅ = 48.4 𝜎 = 4.5
134
Example 9.2
For a sample of 35 renovation jobs, Albert found that the mean number of days taken to
complete a renovation job for a new HDB flat is 42 days with a sample standard deviation of 8
days. Construct a 95% confidence interval for the mean number of days taken to complete a
renovation job.
Solution:
n=35 𝑥̅ = 42 𝑠=8
Therefore, 95% confidence interval for the mean number of days taken to complete a renovation
job falls between 39.35 to 44.65 days.
When the sample taken is small (n < 30) and the population standard deviation, s is unknown,
the normal distribution is replaced by the t distribution to construct confidence intervals about
µ. We use the sample standard deviation, s as a point estimate of s.
135
The confidence interval for µ is
𝑠
𝑥̅ ± 𝑡
√𝑛
The value of t is obtained from the t distribution table (Appendix 3) for n-1 degrees of freedom
and the given confidence level.
Example 9.3
Find the t-value for a sample size of 18 and a 95% confidence level.
Solution:
When n=18, df = n-1 =17.
When confidence level is 95%, the area in the two tails combined (also known as a) is equal to
5% or 0.05. Alternatively, if you look at each tail independently then it will be a/2 or 0.025
(one tail).
Example 9.4
An accounting firm would like to set up a guideline for the time required to complete a certain
type of audit operation. A sample of auditing times from 18 different junior auditors was
obtained with a mean time of 3.2 hours and a standard deviation of 1.6 hours. Determine a 95%
confidence interval for the average time required in completing such type of auditing.
(a) Explain why the t-distribution should be used.
(b) Find the t-value for a 95% confidence level.
(c) Construct a 95% confidence interval.
Solution (a):
There are 2 conditions for using the t-distribution: n <30 and 𝜎 unknown.
136
Solution (b):
n=18 𝑥̅ = 3.2 𝑠 = 1.6
df = n – 1 = 18 -1 =17
t-value = 2.110 (from Appendix 3)
Solution (c):
95% confidence interval:
4 d.Z
𝑥̅ ± 𝑡 = 3.2 ± 2.110
√& √dY
= 3.2 ± 0.796
= (2.404; 3.996) hours
Therefore, the 95% confidence interval for the mean time required in completing such type of
auditing is from 2.404 hours to 3.996 hours.
We may often want to estimate the population proportion or percentage. For example, a
company may want to know the proportion of defective items received in a shipment. A hotel
may want to find the percentage of hotel guests who are satisfied with the service of the hotel.
The population proportion is denoted by π and the sample proportion by p. The confidence
interval for the population proportion is
𝑝(1 − 𝑝)
𝑝 ± 𝑧}
𝑛
The z value is obtained from the standard normal table (Appendix 2) for a given confidence
level. This value is located in the same way that was done for large sample estimates of µ.
K(dgK)
k is the estimated standard error for proportions.
&
Example 9.5
A food company found that in a sample of 100 purchase orders, 10 contain errors. Find the
95% confidence interval of the population proportion of purchase orders that contain errors.
Solution:
Q dV
n = 100 sample proportion, 𝑝 = & = dVV = 0.1
z = 1.96 (look up 0.4750 from the body of standard normal table, Appendix 2)
= (0.041; 0.159)
137
Therefore, the 95% confidence interval for the proportion of purchase orders with errors lies
between 4.1% to 15.9%.
The larger the population variability (measured by 𝜎 or 𝜎 L ), the wider the Confidence Interval.
A larger 𝜎 will increase the standard error leading to a wider or less precise interval.
Samples and not a census are almost always used in research because of limited resources. If
we know the confidence level and the width of the confidence interval that we want, then we
will be able to find the approximate sample size that will produce the required result.
The following formulas will help us determine the required sample size, n.
Sample size for the estimation of µ Sample size for the estimation of π
´¤ L ´ L
𝑛 = ³µ¶ 𝑛 = 𝑝(1 − 𝑝) ³µ ¶
Note that the final answer for the sample size should be rounded UP to a whole number. This
is always the case when determining sample size to ensure that the conditions of confidence
level and margin of error are met.
Example 9.6
A researcher wishes to know the mean annual income of human resource managers in the
manufacturing industry. How large a sample is required if he wants to be 95% confident that
the estimate is within $5000 of the true population mean annual income? Assume the
population standard deviation is $30,000.
138
Solution:
95% confidence level à z = 1.96
Margin of error, E = 5000
𝜎 = 30,000
𝑧𝜎 L
𝑛=³ ¶
𝐸
d.¥Z ¸ TVVVV L
= ³ ¶
XVVV
= 138.3
Example 9.7
Lim Electronics has just installed a new machine that makes a part that is used in car autolocks.
The company wants to estimate the proportion of these parts produced by the machine that are
defective. The manager wants this estimate to be within 0.03 of the population proportion for
a 99% confidence level. What is the minimum sample size required?
Solution:
99% confidence level à z = 2.575
Margin of error, E = 0.03
Since we have no prior information about the value of p, we shall use p=0.5.
𝑧 L
𝑛 = 𝑝(1 − 𝑝) ³ ¶
𝐸
L.XUX L
= (0.5)(0.5) ³ V.VT ¶
= 1841.8
139
8. Discussion Questions
1. A recent survey of 50 executives who were laid off from their previous position revealed
it took a mean of 26 weeks for them to find another position. The standard deviation is
known to be 6.2 weeks.
(a) What is the point estimate of the population mean, µ?
(b) Construct a 95% confidence interval for the population mean.
(c) A manpower ministry personnel says that the mean duration taken to find a job
after being laid off is 20 weeks. Is this estimate reasonable?
2. An insurance company reported that it paid out many claims last year for car accidents.
In a sample of 64 claims made this year, the mean claim amount was found to be $7,300
with a sample standard deviation of $1,200.
(a) Construct the 98% confidence interval for the population mean claim amount
for car accidents.
(b) Suppose a mistake was made in the computation of the standard deviation. The
value of the standard deviation is supposed to be larger. How would this affect
the width of the interval?
3. Furniture Land surveyed 600 consumers and found that 414 were enthusiastic about a
new décor plan they plan to show in a major home exhibition. Construct the 99%
confidence interval for the population proportion.
4. The Health Promotion Board wants to estimate the mean yearly milk consumption. A
sample of 16 people reveals the mean yearly consumption to be 60 litres with a standard
deviation of 20 litres.
(a) Explain why we need to use the t distribution. What assumption do you need to
make?
(b) For a 90% confidence interval, what is the value of t?
(c) Develop the 90% confidence interval for the population mean.
5. Family Health, a publisher of health magazine wants to determine the mean insurance
premium paid by its subscribers. If the population standard deviation is $1,000, what
sample size is needed if the firm wants to be 99% confident of being correct to within
± $250?
6. A survey of 20 teachers found that the mean age of the teachers is 40.6 years old, with
a sample standard deviation of 9.5 years.
(a) Find the 99% confidence interval of the population mean.
(b) An additional 20 teachers were surveyed. The mean age remained at 40.6 years
and the sample standard deviation remained at 9.5 years. 30 out of the 40
teachers have more than 10 years of teaching experience.
(i) Find the 95% confidence interval for the population mean age. (Express
answers to 1 decimal place)
(ii) Find the 95% confidence interval for the proportion of teachers with
more than 10 years teaching experience. (Express answers to 3 decimal
places)
140
7. BBS Bank has been providing incentives for consumers to use mobile banking more
extensively. Susan collected some data from a random sample of 100 customers and
found that the average number of mobile banking transactions each month for these
customers was 8.8 transactions with a standard deviation of 2.8 transactions.
(a) Find the 95% confidence interval for the population mean number of mobile
banking transactions per month.
(b) BBS Bank has set a goal of achieving a mean number of 10 mobile banking
transactions per month. Based on the result in part (a), has the bank achieved
its target? Explain.
(c) What is the minimum sample size required if the bank wants to estimate the
mean number of mobile banking transaction to within 0.6 transactions with a
98% confidence?
141
9. Supplementary Questions
2. A random sample of 150 people had a mean weight of 71.2kg with a standard deviation
of 4.9kg. Construct a 90% confidence interval for the mean weight of the population
from which this sample was taken.
Singaporeans have a mean weight of 68.7kg. Is it likely that that this sample taken was
a sample of Singaporeans? Explain your answer.
3. A Company is considering introducing a new scheme of shift work. They would like to
know whether the scheme is favourable to the majority of workers before they introduce
it. A random sample of 73 workers showed 43 in favour.
Construct a 95% confidence interval and advise the company how they should act.
4. A large company is looking at the time taken by workers to complete a particular job in
a plant. A sample of 41 workers showed a mean time of 34.3 minutes with a standard
deviation of 2.5 minutes. Give a 97% confidence interval for the mean time taken by
workers to complete the job.
5. The standard crop yield for a certain kind of vegetable averages 76 kg per square metre
of land plot. A new fertilizer is applied to a sample of 5 separate one-square metre plots.
The crop yields recorded are:
83, 81, 87, 79, 77
(a) Compute the sample mean and sample standard deviation.
(b) Construct a 95% confidence interval and comment on the results. Does the
fertilizer seem to be making a difference to crop yield? Assume the distribution
of crop yields are normally distributed.
7. When a sample of 70 retail managers were surveyed regarding the poor performance of
the retail industry in the recent quarter, 65% believed decreased sales were due to a
recent increase in goods and services tax (GST).
Find the 95% confidence interval for the proportion of retail managers who believed
that decreased sales were due to increase in GST.
142
8. The standard deviation for a population is 𝜎 =16.4. A sample of 100 observations
selected from this population gave a mean equal to 143.72.
(a) Construct a 90% confidence interval for µ.
(b) Construct a 95% confidence interval for µ.
(c) Construct a 99% confidence interval for µ.
(d) Does the width in parts (a) to (c) increase as confidence level increases?
Explain.
10. As an executive of the Consumer’s Association, you took a random sample of 10 cans
of baked beans at a canning plant. The net weights of the beans (in ounces) are reported
in the table below.
16.2 16.1 15.6 15.8 16.2
16.1 15.9 16.0 15.7 15.9
143
BUSINESS STATISTICS
SESSION 10
1. Introduction
In a test of hypothesis, we test a certain given belief about a population parameter. If someone
makes a claim about the general population value, how can we substantiate that?
Hypothesis testing allows us to evaluate the situation using sample information and then
conclude if the claim is true or has to be rejected. Obviously, a sample value is usually different
from the claim about the population value – the task is to judge if the “observed difference”
between a sample statistic and the hypothesized value of the population parameter is
statistically significant.
The null hypothesis (Ho) is a claim (or statement) about a population parameter that is assumed
to be true. The alternative hypothesis (H1) is the opposite of Ho. It is a claim that will be true
if the null hypothesis is false.
Assuming we are testing a claim about the population mean (µ), there are three possible choices
of formulating Ho and H1.
Ho : µ = µ0 e.g. Ho : µ = 10
Left-tailed test:
H1 : µ < µ0 H1 : µ < 10
144
2.2 Significance Level
In doing hypothesis testing, a significance level (a) is set. a is a probability or area that
represents the probability of rejecting Ho when it is true. This is further explained under Part 4
(Types of Errors).
This is either a “z” or “t” value calculated based on sample information. It is a quantity used
in deciding whether or not to reject the Ho.
𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛
Testing a proportion:
𝑝−𝜋
𝑧=
k𝜋(1 − 𝜋)
𝑛
The size of the rejection region depends on the value of the significance level (a). Although
any value can be assigned to a, the commonly used values of a are 0.01, 0.05 and 0.10. We
may have one or two rejection regions depending on whether it is a left-tailed, right-tailed or
two-tailed test. (Refer Figure 10.1). The regions outside the shaded region a, is the non-
rejection region.
Left-tailed test Right-tailed test Two-tailed test
145
The rejection region is also known as the critical region. For a two-tail test, the area or a is
split into the two tails. The rejection region in each tail is 𝛼D2.
There are five basic steps to follow when performing a hypothesis test:
Step 4: Determine the critical value and form the decision rule.
We shall look at various situations to see how these steps are carried out.
Based on the Central Limit Theorem, the sampling distribution of 𝑥̅ is approximately normal
for large samples (n ³ 30). Hence, whether 𝜎 is known or unknown, the normal distribution
is used to test the hypothesis about a population mean whenever we have a large sample.
Example 10.1
A polyclinic uses a certain drug with a mean packaged dose of 100 cm3. The standard deviation
is known to be 3 cm3. A random sample of 36 doses is selected and the mean dosage was found
to be 101cm3. Test at 0.01 significance level whether the mean dosage in the packages is larger
than 100 cm3?
146
Solution:
Step 1 Ho : µ £ 100
H1 : µ > 100 [This is a right-tailed test]
Step 2 a = 0.01 n = 36 s = 3 X = 101 à Use Z since n is large.
Q̅ gh
Step 3 Test statistic, 𝑧 = º
√»
dVd g dVV
= ¼ = 2.0
√¼½
Since test statistic falls outside rejection region, we do not reject H0. There is
insufficient evidence to conclude that the mean dosage is larger than 100 cm3 at
a = 0.01 level of significance.
Example 10.2
A soft drink manufacturer claims that each soft drink can has a volume of 150 ml. A sample
of 40 soft drink cans were randomly selected for quality control check and the sample mean
was found to be 130 ml with a standard deviation of 60 ml. Test at 0.01 level of significance
whether the average volume of soft drink in the cans differ from 150 ml.
Solution:
Step 1 Ho : µ = 150
H1 : µ ¹ 150 [This is a two-tailed test]
Step 2 a = 0.01 n = 40 s = 60 X = 130 à Use Z since n is large.
Q̅ gh
Step 3 Test statistic, 𝑧 = ¾
√»
dTV g dXV
= ½¿ = −2.11
√À¿
147
Step 5
Since test statistic falls outside the rejection region, we do not reject H0. There
is insufficient evidence to conclude that the mean volume of soft drink in the
cans differ from 150 ml at a = 0.01 level of significance.
3.2 Hypothesis Test about a Population Mean : Small Sample and 𝝈 unknown
When a population is approximately normally distributed but the population standard deviation
𝜎 is unknown and the sample size is small (n<30), the normal distribution is replaced by the t
distribution to make a hypothesis test about µ.
The steps to conduct the test in the case of small samples is similar to the one for large samples.
The only difference is the use of the t distribution in place of the normal z distribution.
Example 10.3
A new manager at JE Country Club has been told by his predecessor that the club’s members
have an average length of membership of 8.7 years. The manager took a random sample of 15
membership files, and he found the mean length of membership to be 7.2 years with a standard
deviation of 2.5 years. Assume the length of the membership in this club is normally distributed.
At a 0.05 level of significance, does this sample result suggest that the actual mean length of
membership in this club may be less than 8.7 years?
Solution:
Step 1 Ho : µ ³ 8.7
H1 : µ < 8.7 [This is a left-tailed test]
148
Step 5
Since test statistic < -1.761, we reject Ho. The data seems to suggest that the
average membership length at this club is less than 8.7 years at a = 0.05.
We may sometimes want to conduct a test of hypothesis about a population proportion. For
example, a company may claim that 99% of their orders are shipped on time. The quality
control department may want to check from time to time whether this claim is true.
Example 10.4
300 consumers were sampled and it was found that 37% used Brand A toothpaste. A similar
study conducted 5 years ago showed that 32% of the consumers used Brand A toothpaste. At a
10% significance level, is there evidence that there is a change in the proportion of consumers
using Brand A toothpaste?
Solution:
Step 1 Ho : π = 0.32
H1 : π ¹ 0.32 [This is a two-tailed test]
Step 2 a = 0.1 n = 300 p = 0.37 à Use Z
KgÃ
Step 3 𝑧=
kÄ(ÂÅÄ)
»
V.TUgV.TL
𝑧= = 1.86
k¿.¼j(ÂÅ¿.¼j)
¼¿¿
Step 4 Critical value, z = ± 1.645
[Look for area =0.45 in Appendix 2 to obtain the critical z value]
Decision rule: Reject Ho if test statistic > 1.645 or test statistic < -1.645
Step 5
149
Since test statistic > 1.645, we reject Ho. There is evidence that the proportion
of consumers using Brand A toothpaste has changed at a = 0.1 level of
significance.
When we make a decision, it would be nice if it were always a correct decision. This, however,
is statistically impossible, since we are making decisions on the basis of sample information.
The best we can hope for is to control the risk, or the chance with which an error occurs.
There are four possible outcomes in hypothesis testing that could be reached as shown in
Table 10.1.
Actual Situation
H0 is true H0 is false
A correct decision occurs when Ho is true and we do not reject Ho. A correct decision also
occurs when Ho is false and our decision is to reject Ho.
A Type I error occurs when a true null hypothesis is rejected, that is, when the null hypothesis
is true but we decided against it. The probability assigned to Type I error is a which is known
as the significance level of the test.
A Type II error occurs when we fail to reject a Ho that is actually false. The probability
assigned to Type II error is b. We can calculate the probability of a Type II error if and only
if the Ho is false and we know the true (actual) population value. The computations will not be
dealt with here.
There is an inverse relationship between a and b. Hence, a lower a is not necessarily better.
In our previous examples, the value of the significance level a is given. We then used the
sample test statistic to compare to the critical value for the given a to draw our conclusion.
In the probability-value approach, more commonly known as the p-value approach, the sample
test statistic will not be compared with the critical value. We will instead find the probability
of getting a sample statistic value as extreme as, or more extreme than, the sample statistic
value under study. Hence, the p-value may be defined as the smallest significance level (a) at
which the Ho is rejected.
150
The p-value will be compared with a to make a decision about the hypothesis.
For a one-tailed test, the p-value is given by the tail area beyond the value of the sample statistic.
See Figure 10.3.
For a two-tail test, the tail area beyond the value of the sample statistics is multiplied by 2.
See Figure 10.4
Example 10.5
The management of the sports club at a university claims that students use the gym about 10
times or more each year on average. To check this claim, a trainer at the gym took a sample of
36 students and found that the number of visits was 9.2. The population standard deviation is
known to be 2.4 times. The trainer would like to test whether the mean number of times that
students use the gym is lower than what was claimed. Find the p-value for this test.
151
Solution:
n = 36 X = 9.2 s = 2.4
The area to the left of 𝑥̅ = 9.2 (or z < -2.00) is the p-value. From the standard normal table
(Appendix 2), we will get 0.4772 which is the area between the mean and z = -2.
As the trainer is testing whether the mean is lower than 10, this would be a left-tail test.
Suppose a is 1%, we have p-value, 0.0228 > 0.01. Ho will not be rejected.
Suppose a is 5%, we have p-value, 0.0228 < 0.05. Ho will be rejected.
Although the conclusion for the test depends on the significance level set, we can say that the
smaller the p-value, the higher the chance that Ho will be rejected.
We do not need to know the critical value when we use the p-value approach in hypothesis
testing.
152
6. Discussion Questions
2. The mean life of a battery used in a digital clock is 305 days. The lives of the batteries
follow the normal distribution. The battery was recently modified to last longer. A
sample of 49 of the modified batteries had a mean life of 311 days with a standard
deviation of 12 days. Did the modification increase the mean life of the battery? Show
all steps in a hypothesis test with appropriate workings. Use 5% significance level.
3. The mean time required to perform a certain job on a factory floor is 20.5days. A
random sample of 16 employees is taught using a new method to perform the job. After
training, the mean time taken by these 16 employees is 17 days with a standard deviation
of 5 days. Assume that the time required to perform the task is normally distributed.
Do the results provide sufficient evidence to indicate that the time taken to perform the
job has reduced under the new method? Use α=0.01
4. A firm has decided that it will market a new product only if at least 35% of the people
like it. A random sample of 400 persons shows that only 128 or 32% said they liked it.
Using a=5% can the firm conclude that the proportion of people liking the product is
less than 35%? Show all steps clearly.
5. With internet banking, BBS Bank believes that customers have reduced their number
of visits to the Bank to perform banking transactions. Susan was asked to test whether
the average number of visits to the bank by personal consumers has reduced from last
year’s mean of 20 visits. A sample survey of 25 customers revealed a mean of 18 visits
per year with a sample standard deviation of 3.5 visits.
153
7. Supplementary Questions
1. Meiru was hired as a server at Tea Garden Family Restaurant and was told that she
could receive an average of more than $20 a day in tips. Assuming that the population
of daily tips follows the normal distribution with a standard deviation of $3.24. Over
the first 35 days she was employed at the restaurant, the mean daily amount of her tips
was $24.85. Using the 35 days as the sample, can Meiru conclude that she is earning
an average of more than $20 in tips at the 1% significance level?
3. SK-One claims that the mean potency strength of one of its weight reduction capsule
was at least 80. A random sample of 100 capsules were tested and produced a sample
mean potency strength of 78.5 with a standard deviation of 5.1.
(a) Does the data present sufficient evidence to reject SK-One’s claim at a=0.05?
State your null and alternative hypotheses clearly.
(b) What is the chance of making a Type I error? What does a Type I error indicate
in this case?
4. A fresh orange juice vending machine is set to produce cups of fresh orange juice with
an average content of 190 ml. A random sample of 10 cups is drawn, and the average
was found to be 186 ml, with a standard deviation of 5 ml ounce. The company would
like to determine whether this finding is consistent with the hypothesis that the
machinery is operating properly and producing drinks whose average content is 190ml.
Assume that the contents are approximately normally distributed.
5. A recent article in the Property Edge magazine reported that the 30-year mortgage rate
is now less than 6%. A sample of 8 small banks in the country revealed the following
30-year rates (in percent)
4.8 5.3 6.5 4.8 6.1 5.8 6.2 5.6
154
6. The Quikbite restaurant chain claims that the mean waiting time of customers for
service is 3 minutes with a population standard deviation of 1 minute. The quality
assurance department found in a sample of 50 customers that the mean waiting time
was 2.75 minutes.
(a) At the 0.05 significance level, can we conclude that the mean waiting time is
less than 3 minutes?
(b) What is the p-value? What is your decision regarding the null hypothesis based
on the p-value? Is this the same as the conclusion reached in part (a)?
8. A fast food outlet would like to prove that the mean sales of its burgers per week is
significantly lower than the overall average of $3,425 for all outlets combined. To do
so, this outlet sampled 40 weeks and found that the mean sales per week was $3,300.
The sample standard deviation was found to be $200.
(a) State the null and alternative hypotheses.
(b) At a=0.02, can the outlet conclude that its sales is significantly lower than
$3,425?
10. When working properly, a machine that is used to make chips for mobile phones does
not produce more than 4% defective chips. Whenever the machine produces more than
4% defective chips, it needs an adjustment. To check if the machine is working
properly, the quality control department takes a random sample of 200 chips and found
14 defective chips.
Test at the 5% significance level whether or not the machine needs an adjustment.
155
BUSINESS STATISTICS
SESSION 11
___________________________________________________________________________
1. Introduction
In Session 10, we focused on testing the value of a particular population parameter such as a
mean, µ or a proportion, p. We now look at another testing procedure, one which deals with
testing for association between two categorical variables.
In Session 7 (Linear Regression and Correlation), we dealt with whether two quantitative
variables are related to each other or are correlated. It was possible to measure the strength of
correlation between the 2 variables and also to determine to what extent a change in a variable
explains the change in the other variable.
Often time, the variables that we deal with are not numeric variables. For example, we want to
know whether gender and preferred product brand are related. In this case, we will ask whether
there is an association between Gender and Brand. Or, we say “Are gender and brand
independent?” We will need frequency data in order to answer such questions. Thereafter, we
conduct a chi-square test of independence/association.
o It is positively skewed.
o It is non-negative.
o It is based on degrees of freedom.
o Whenever the degrees of freedom change, a new distribution is created. (see
Figure 11.1).
156
Figure 11.1 Chi-square distribution curves
The symbol c is a Greek letter and is pronounced as “kye”. The values of a chi-square
distribution are denoted by the symbol c2, just as the values of the standard normal distribution
and the t distribution are denoted by z and t respectively.
4. Contingency Tables
A contingency table is also known as a cross-tabulation. The data in the table represent
frequencies (counts) where observations are organised in cross-tabulated categories.
Example 11.1
Students in non-business courses at a polytechnic were required to choose one of three elective
subjects. The following table show the subjects chosen and the gender of a sample of 300
students.
Elective Subject
Market
Gender Stock Investment Creative Arts Total
Research
Male 93 70 12 175
Female 87 32 6 125
Total 180 102 18 300
Table 11.1 Contingency Table
The cell frequencies are known as observed frequencies and show how the data are spread
across the different combinations of Gender and Elective Subjects. The size of the table is
determined by the number of row categories and number of column categories. Table 11.1 has
2 rows and 3 columns and is known as a 2 X 3 table.
The basic idea of chi-square test is to compare the observed frequencies which we have obtained
from sample data with a set of expected frequencies, conditional on the null hypothesis of no
association between the variables, that is, the variables are independent.
If the observed (actual) and expected (theoretical) frequencies are nearly alike (that is, we
observe only small differences), we can conclude that two variables are not related. If these
frequencies differ substantially, then there is stronger evidence that the two variables are
157
related. However, to be more precise, we require a method of calculating this degree of
difference before we are able to draw a conclusion. The c2 statistic provides us the method to
do this.
To conduct a chi-square test based on the data in Table 11.1, we proceed as follows:
Solution:
Step 1: State the null and alternative hypotheses
In a chi-square test of independence, the null hypothesis must be that the two attributes are
independent or not related. Consequently, the alternative hypothesis is that the attributes are
related. Referring to the contingency table (Table 11.1), we have
Step 2: Select the distribution to use and choose a significance level (a) to conduct the test.
We use the chi-square distribution to make a test of independence for a contingency table. We
shall use a 5% significance level.
Before the calculated value of chi-square can be found, we need to determine the expected
frequency (E) for each cell assuming the two variables are not related. The expected frequency
for each cell is calculated using:
Elective Subject
Market
Gender Stock Investment Creative Arts Total
Research
O E O E O E
Male 93 105.01 70 59.52 12 10.53 175
Female 87 75.04 32 42.55 6 7.56 125
Total 180 180 102 102 18 18 300
Table 11.2 Observed and Expected Frequencies
158
Using this formula, the expected frequencies (E) of the above six cells are calculated as follows:
1 175 𝑋 180 4 125 𝑋 180
= 105.0 = 75.0
300 300
2 175 𝑋 102 5 125 𝑋 102
= 59.5 = 42.5
300 300
3 175 𝑋 18 6 125 𝑋 18
= 10.5 = 7.5
300 300
é (O - E )2 ù
c 2 = åê ú
ë E û
Step 4: Determine the c2 critical value and form the decision rule.
The chi-square test is always right-tailed, hence the rejection region falls on the right tail of the
chi-square distribution. The contingency table contains two rows (Male and Female) and three
columns (Stock investment, Market research and Creative arts). Note that we do not count the
row and column totals.
From Appendix 4 (Chi-square distribution table), for df = 2 and a= 0.05, we have the critical
c2 =5.991. (See Figure 11.2)
The rejection region and non-rejection regions are shown in Figure 11.3
159
Figure 11.3 Rejection and Non-Rejection Regions
Example 11.2
A social scientist sampled 140 people to study the relationship between income level and lottery
playing.
Income Level
Lottery Low Middle High Total
Play 46 28 21 95
Did not Play 14 12 19 45
Total 60 40 40 140
Is it reasonable to conclude that playing lottery is related to income level? Use 0.01 significance
level.
Solution:
Ho: There is no relationship between income level and lottery playing
H1: There is a relationship between income level and lottery playing
a= 0.01
Expected Frequencies (E)
Income
Lottery Low Middle High Total
é (O - E )2 ù
c 2 = åê ú
ë E û
(𝟒𝟔g𝟒𝟎.𝟕𝟏)𝟐 (𝟐𝟖g𝟐𝟕.𝟏𝟒)𝟐 (𝟐𝟏g𝟐𝟕.𝟏𝟒)𝟐 (𝟏𝟒g𝟏𝟗.𝟐𝟗)𝟐 (𝟏𝟐g𝟏𝟐.𝟖𝟔)𝟐 (𝟏𝟗g𝟏𝟐.𝟖𝟔)𝟐
= 𝟒𝟎.𝟕𝟏
+ 𝟐𝟕.𝟏𝟒
+ 𝟐𝟕.𝟏𝟒
+ 𝟏𝟗.𝟐𝟗
+ 𝟏𝟐.𝟖𝟔
+ 𝟏𝟐.𝟖𝟔
= 6.544
160
Degrees of freedom (df) = (Rows -1) (Columns -1)
= (2-1) (3-1) = 2
Since c2 statistic < 9.210, we do not reject Ho. There is no relationship between income level
and lottery playing at a= 0.01.
• The observed and expected frequencies shall be “sufficiently large” for the results to be
more reliable.
161
8. Discussion questions
Test at 0.01 level of significance whether there is any relationship between the
employment status and average weekly spending on cosmetics among women.
2. Jo Sport has a new design of spike shoes and wishes to determine whether there are any
differences in three media used in terms of exposure of an advertisement. The results
of the study are as follows
Media Used
Seen Ad? Television Magazine Internet Total
Yes 70 20 35 125
No 30 30 15 75
Total 100 50 50 200
3. SQA Tours has developed 2 different itineraries for tourist groups visiting Singapore for
3 days. The following table shows the itinerary chosen and the country of origin of these
tourist groups.
(a) At a = 0.05, is there sufficient evidence to conclude that the preferred itinerary
depends on country of origin? Do a chi-square test and show all steps clearly.
162
9. Supplementary questions
1. A sample of 150 students applying for a place in medical school were tested for
personality type. The following table gives the results of the survey:
Test at the 5% significance level if gender and personality type are related for all
students.
2. The table below shows a contingency table for a sample of 1104 randomly selected
adults from three types of environment (Urban, Suburban and Rural) and has been
classified into two groups by the level of exercise.
Level of exercise
Total
Environment High Low
Urban 221 256 477
Suburban 230 118 348
Rural 159 120 279
Total 610 494 1104
Test the hypothesis that there is no relationship between level of exercise and type of
environment and draw conclusions. (use a = 1%)
3. In an online road safety research, many drivers admitted to unsafe road behaviour. The
age group and most frequent type of unsafe road behaviour are tabulated below:
Is there a relationship between age group and the most frequent type of unsafe road
behavior at the 1% significance level? Do a chi-square test and show all steps clearly.
(Express all computations to 1 decimal place)
163
4. Winn Electronics manufactures component parts for tablets and mobile phones. The
company has two machines that are used to make a component called CS1. From time to
time the quality controller at the company takes a sample of CS1 and checks them. A
recent check of 200 units of CS1 showed the following results:
Does the sample provide sufficient evidence to conclude that the two variables, the
machine type and the component quality (good or defective) are dependent? Do a chi-
square test using 1% significance level.
164
BUSINESS STATISTICS
SESSION 12
1. Introduction
The Analysis of Variance or ANOVA is a procedure used for comparing two or more groups
to establish if the means are equal. We will examine two or more independent samples to
determine if the population means could be equal with respect to only one factor. Hence, this
is known as a one-way ANOVA procedure.
Figure 12.1
The first chart in figure 12.1 shows that all three population means are equal. In the second
chart, not all means are equal.
The basic idea behind ANOVA is that we go through an analysis of the variation in the data,
both between and within the k number of groups. Through an analysis of the total variation in
the data for both between and within the k groups, we are able to draw conclusions about
possible differences in group means.
In ANOVA, we subdivide the total variation into that which is attributable to differences
between the k groups and that which is due to chance or random variation within the k groups.
165
“Within group” variation is considered experimental error, “between group” variation is
attributable to treatment effects.
3. Assumptions
The F-distribution will be used to test whether the two or more population means are equal. A
one-way ANOVA is carried out under the following assumptions:
- Each sample is drawn from a normal or approximately normal population.
- The populations have equal variances (or standard deviations).
- The samples are randomly selected and are independent.
4. The F-distribution
Like the t and chi-square distributions, the shape of a particular F distribution curve depends
on the number of degrees of freedom. The characteristics of the F Distribution are:
- The F value is always non-negative.
- The F distribution is a family of continuous distribution and the shape is skewed to the
right, but the skewness decreases as the number of degrees of freedom increases.
- It has two numbers for the degrees of freedom : Degrees of freedom for the numerator
and the degrees of freedom for the denominator. The distribution is a continuous
distribution.
166
5. Performing a One-way Analysis of Variance
We will now look into the procedure to carry out a one-way ANOVA to test whether the means
of two or more populations are equal.
The steps to carry out an ANOVA test are as follows:
1. State the null and alternative hypotheses.
2. Select a level of significance (a).
3. Calculate the value of the F test statistic.
4. Determine the F critical value and formulate the decision rule.
5. Arrive at conclusion.
Example 12.1
A large company used three different training methods in orientating new marketing trainees
to their jobs. Upon completion of the training period, the training director chose 15 trainees,
who were randomly assigned to the three training methods. To compare the effectiveness of
these training methods, the researcher examined the quarterly sales (units) made by the 15
trainees. The results are shown in Table 12.1 below:
At the 0.05 level of significance, test whether the mean sales for each of the three training
methods are the same. Assume that all the assumptions required to apply the one-way ANOVA
procedure hold true.
Solution:
To conduct the ANOVA test, we proceed as follows:
Step 2: Select the distribution to use and choose a significance level (a) to conduct the test.
167
Step 3: Calculate the value of the test statistic (i.e. F statistic)
𝑆𝑆𝑇D
(𝑘 − 1) 𝑀𝑆𝑇
𝐹= =
𝑆𝑆𝐸D 𝑀𝑆𝐸
(𝑛 − 𝐾)
where SST is the Treatment variation
SSE is the variation due to the Random or Error component
k = number of groups
n= total sample size
The numerator is the Mean Square Treatment which is the mean variation between different
treatment groups.
The denominator is the Mean Square Error which is the mean variation within each treatment
group.
First, we compute the total variation (SS Total). Total variation refers to the sum of the squared
differences between each observation and the overall (grand) mean.
∑Q
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = ∑(𝑋 − 𝑋’Ì )L where 𝑋’Ì = &
∑Q
z
𝑋Ì = &
¥LIUVIZXIULIYdIXYIY¥IZYIUUI¥WIZdIXXI¥¥IZWIUY
𝑋’Ì = dX
= 74.87
= (92 - 74.87)2 + (70 - 74.87)2 + (65 - 74.87)2 + (72 - 74.87)2 + (81 - 74.87)2
+(58 - 74.87)2 + (89 - 74.87)2 + (68 - 74.87)2 + (77 - 74.87)2 + (94 - 74.87)2
+(61 - 74.87)2 + (55 - 74.87)2 + (99 - 74.87)2 + (64 - 74.87)2 + (78 - 74.87)2
= 2659.73
Next, we compute the random variation. This is the sum of the squared differences between
each observation and its treatment mean.
168
𝑆𝑆𝐸 = ∑(𝑋 − 𝑋’' )L
where 𝑋’' = Column mean, that is the sample mean of each group
¥LIUVIZXIULIYd
Sample mean for method 1, 𝑥̅d = X
= 76
XYIY¥IZYIUUI¥W
Sample mean for method 2, 𝑥̅L = X
= 77.2
ZdIXXI¥¥IZWIUY
Sample mean for method 3, 𝑥̅T = X
= 71.4
= (92 -76)2 + (70 -76)2 + (65 -76)2 +(72 -76)2 (81 -76)2
+ (58 -77.2) 2 + (89 -77.2) 2 +(68 -77.2) 2 +(77 -77.2) 2 +(94 -77.2)2
+ (61 -71.4)2 + (55 -71.4)2 + (99 -71.4)2 + (64 -71.4)2 + (78 -71.4)2
= 2566
Step 4: Determine the F critical value and formulate the decision rule.
The significance level is 0.05, which means the area in the right tail of the F distribution is 0.05.
Degrees of freedom (numerator) =k–1 =3–1=2
Degrees of freedom (denominator) = n – k = 15 -3 =12
Referring to Appendix 5 (F distribution), the critical value of F is 3.89 (see Figure 12.4)
169
Figure 12.4 F-Distribution Table
The value of the F test statistic = 0.219 is less than the critical F of 3.89 and it falls outside the
rejection region (see Figure 12.5). Hence, we do not reject the null hypothesis and conclude
that there is insufficient evidence to show a difference in the mean sales among the three
methods at a= 0.05.
170
Next click Tools-> Data Analysis->Anova:Single Factor
Statistical software packages e.g. SPSS, Minitab will also provide output for the ANOVA
Table using a similar format. (see Table 12.2)
171
Example 12.2
Last year was a boom year for the stock market with stock prices for many industries hitting
historical highs. A study was carried out to find out whether the mean rates of return (in
percent) for stocks of companies in four industries are the same. The following results were
generated using Excel.
Solution (a):
A: df (numerator) = k – 1 = 4 -1 = 3
B: MSE = SSE/(n-k) = 35.7324/25 = 1.4293
Solution (b):
H0 : µproperty = µbanking = µmanufacturing = µtechnology
H1: The means are not all equal
Solution (c):
SST
F=
(k - 1)
SSE
(n - k )
42.1592D
(4 − 1)
= = 9.832
35.7324D
(29 − 4)
Since F statistic > 4.68, reject Ho. There is a difference in the mean rates of return for
companies in the four industries. at a=0.01.
172
7. Discussion questions
1. A physician who specializes in weight control has three different diets he recommends.
He randomly selected 15 patients and then assigned 5 to each type of diet. After six
months, the following weight loss in kilograms were noted.
At the 0.05 significance level, can he conclude that there is a difference in the mean
amount of weight loss among the three diet types. Show all steps clearly.
2. A factory owner would like to determine whether there is a difference between the mean
numbers of breakdowns in three factories at different locations. He recorded the number
of breakdowns in each factory for a sample of 8 days.
The owner performed an ANOVA analysis and produced the following output:
Groups Count Sum Average Variance
Tuas 8 27 3.375 1.696429
Tampines 8 18 2.25 2.5
Sembawang 8 29 3.625 2.5533571
Source of Variation SS df MS F
Between Groups 8.58 2
Within Groups 47.25 21
Total 55.83 23
At the 0.01 significance level, can the factory owner conclude if there is a difference
in the mean number of breakdowns?
173
3. Amazing Tours wants to determine whether the average expenditure of the tourists
from China, Japan, India and Korea are different. A sample of 7 tourists was taken
from each of the 4 countries.
4. Roger wishes to find out whether there were differences in the mean satisfaction ratings
(1=lowest and 10=highest) for the 3 flat types that he renovated. Results from a sample
of clients are shown below:
Source of Variation SS df MS F
Between Groups A 2 2.1119 C
Within Groups 29.9761 17 B
Total 34.2000 19
(a) State the null and alternative hypotheses.
(b) Fill in the missing values labelled “A”, “B” and “C”
(c) Using a=0.05, test whether there were significant differences in the mean
satisfaction ratings amongst the clients of different flat types. Show all steps
clearly.
174
8. Supplementary questions
1. The following ANOVA table, based on information obtained for four samples selected
from four independent populations that are normally distributed with equal variances,
has a few missing values.
Source of Variation SS df MS F
Between Groups
4.07
Within Groups 15 9.2154
Total 18
(a) Find the missing values and complete the ANOVA table.
(b) Using a=0.05, what is your conclusion for the test? State clearly your null
and alternative hypothesis.
2. Given the following table of information, use of a significance level of 0.01 and test
whether the treatment means are equal.
3. AA Cooker is one of the most popular brands of rice cooker and the cookers are being
sold at various outlets of supermarkets. To further improve its marketing strategies, the
cookers are test-marketed by having displays placed at different areas of the
supermarket. The table below shows the number of cookers successfully sold at five
different locations in the supermarket during three randomly selected days.
Using a significance level of 0.01, state whether there are any differences in the mean
number of cookers sold at the five different locations.
(a) State the null and alternate hypotheses.
175
(b) State the decision rule.
(c) Given the following partial ANOVA table, draw your conclusions.
Source of
SS df MS F
Variation
Treatments 155.6 4
Error 17.33333 10
Total 172.9333 14
You helped to analyse the data using Excel and obtained the following results.
ANOVA
Source of Variation SS df MS F
Between Groups 2 67
Within Groups 251
Total
(a) Which training programme shows the highest average sales figure amongst the
staff?
(b) Formulate the null and alternative hypothesis.
(c) Do the data provide sufficient evidence to indicate a difference in mean sales
for staff trained under the three programmes? (assume a=5%)
176
5. A one-way ANOVA analysis was conducted to explore whether there were significant
differences in the mean credit card spending (in $) for 3 types of credit cards that ABC
Bank issues. Partial results of the analysis are shown in the tables below:
ANOVA
Source of Variation SS df MS F
Between Groups 133800.1583
Within Groups 3168720.042
Total 3302520.2
(a) How many groups (k) were surveyed and what is the total sample size(n)?
(b) State the null and alternative hypotheses.
(c) Compute the F statistic.
(d) At the 1% significance level, is there a difference in the mean credit card
spending for the three types of cards? Show all steps clearly.
177
MOCK EXAM PRACTICE PAPER
Section A
Question 1
(a) A marketer wants to obtain feedback for the design of a new product packaging. Five
designs are being considered and respondents were asked to rank their preferences for
these designs (5= Most Preferred and 1= Least Preferred). The sample of respondents
was obtained by interviewing every 20th shopper who walks into a particular store.
(b) An interior designer, has secured several jobs from a newly completed Build-to-Order
(BTO) project. His clients have varying budgets. The following table shows the
frequency distribution of the renovation budgets from a sample of 20 clients.
Budget ($000) Frequency
5 up to 20 3
20 up to 35 7
35 up to 50 5
50 up to 65 4
65 up to 80 1
(i) Compute the mean and standard deviation. (Express your answers up to 2
decimal places)
(iv) We should always use the mean as a measure of central tendency. Discuss this
statement.
(c) The number of traffic violations due to speeding is on the rise. The table below shows
the data collected from 500 drivers who have speeding violations over the past three
months and their car type.
Speeding violation
Car Type Total
Yes No
Small cars (< 2500 cc) 50 150 200
Big cars (>=2500 cc) 75 225 300
Total 125 375 500
(i) Find the probability that a randomly selected driver has a speeding violation.
178
(ii) Given that a randomly selected driver drives a small car, what is the probability
that the person does not have a speeding violation?
(iii) Find the probability that a randomly selected driver has a speeding violation and
drives a big car.
(iv) Find the probability that a randomly selected driver does not have a speeding
violation or does not drive a big car.
Question 2
(a) What is the purpose of drawing a scatter diagram when using correlation and regression
analysis?
(b) The supervisor at Hoe’s factory has collected data on a random sample of 8 workers
about the time workers spent on attending training programmes (in hours) and the
number of units manufactured per day.
(iv) Estimate the number of units manufactured by a worker who spends 42 hours
on training programmes. Comment on the likely accuracy of the estimate.
179
(c) A car dealer wanted to investigate how the price of one of its car models decreases with
age. The research department took a sample of eight cars and collected the following
information on the age and selling price of these cars. The data are shown below:
Age of car (years) 8 3 6 9 2 5 6 3
Selling Price ($000) 30 70 26 34 85 36 40 69
Regression Statistics
Multiple R 0.867335
R Square 0.75227
Adjusted R
Square 0.710982
Standard Error 12.023763
Observations 8
Standard
Coefficients Error t Stat P-value
Intercept 89.603448 10.472557 8.556023 0.000139
Age of car
(years) -7.781609 1.823038 -4.268483 0.005271
(iii) Determine the coefficient of correlation and interpret the value. (Express answer
to 3 decimal places).
180
Section B
Question 3
(a) A random sample of 120 small retail outlets showed that 75 of the firms use cashless
payments.
(i) Calculate a point estimate for the proportion of all small retail outlets sole
proprietorship firms that use cashless payments.
(ii) Construct a 95% confidence interval to estimate the proportion of all sole
proprietorship firms that use cashless payments.
(b) A property agent is interested to find out the prices of private apartments located at
Changi. He sampled 70 properties and found the mean and standard deviation to be
$940,000 and $133,000 respectively. Compute a 90% confidence interval for the true
mean price of private apartments located at Changi.
(c) For each of the following changes, state whether confidence interval for µ will become
wider or narrower:
(ii) At the 1% significance level, is there evidence to conclude that the mean daily
revenue is less than what the owner claimed?
181
Question 4
(a) In recent years, there has been concern that in-patient hospital charges at government
hospitals differ widely. To find out whether the average hospital bill differ significantly,
data were collected from patients who stayed 5 days at four government hospitals for a
similar medical condition.
ANOVA
Source of
Variation SS df MS F
Between
Groups 17411832.3 3 5803944.099 ?
Within Groups 33291705.86 22 1513259.357
Total 50703538.15 25
Examine the above Excel output which attempts to analyse whether the average bills
incurred at four hospitals are the same.
(ii) At the 1% significance level, is there a significant difference in the mean hospital
bill for the four hospitals? Show all steps of the ANOVA test clearly.
(b) The average 5-day hospital bill at AA hospital is known to be normally distributed with
mean $6800 and a standard deviation of $1200.
(ii) Suppose 10% of the patients’ bills are $X or less. Find the value of X.
(c) Cantone Group has three similar restaurants located at different parts of Singapore.
The restaurant management has commissioned a survey to determine customers’
satisfaction on the quality of food at each of the three restaurants. A random sample
of 100 customers was selected from each restaurant.
The results of the survey are shown in the following table:
Overall Satisfaction Rating TOTAL
Excellent Average Below Average
Restaurant 1 59 32 9 100
Restaurant 2 48 44 8 100
Restaurant 3 64 26 10 100
TOTAL 171 102 27 300
182
MOCK PRACTICE PAPER (ANSWERS)
Section A
Question 1
(a)(i) Discrete
(ii) Ordinal scale
(iii) Systematic sampling
(b)(i)
20 745 5523.75
Ï+x UWX
𝑥̅ = &
= LV
= 37.25 𝑖. 𝑒. $37,250
Ï+(xgQ̿ ) XXLT.UX
𝑠=k &gd
=k LVgd
=17.051 i.e. $17,051
(iii)
6
No of clients
0
12.5 27.5 42.5 57.5 72.5
Budget ($000)
183
(iv) The mean is a familiar concept and simple to understand and compute.
However, it is affected by extreme values. In such situations, the median may be
preferred.
(iv) P(NoÈ Not Big car) = P(No) + P(Small car) - P(NoÇSmall car)
= 375/500 + 200/500 – 150/500 = 425/500 = 0.85
Question 2
(a) A scatter diagram provides an indication whether any relationship exists between the
two variables under study.
(b)(i)
(ii)
𝒏𝚺𝑿𝒀g𝚺𝑿𝚺𝒀 𝟖(𝟏𝟐𝟕𝟎)g(𝟏𝟔𝟖)(𝟓𝟔)
𝒃 = 𝒏𝚺𝑿𝟐 g(𝚺𝑿)𝟐 = 𝟖(𝟒𝟎𝟑𝟒)g(𝟏𝟔𝟖)𝟐
= 0.18577
z = 𝟓𝟔 – 𝟎. 𝟏𝟖𝟓𝟕𝟕 ±𝟏𝟔𝟖² = 3.0988
z − 𝒃𝑿
𝒂=𝒀 𝟖 𝟖
• = 𝟑. 𝟎𝟗𝟖𝟖 + 𝟎. 𝟏𝟖𝟓𝟖𝑿
Regression equation : 𝒀
(iii) b= 0.1858
b is positive. This indicates a direct relationship between no of training hours and no
of units manufactured. For every one hour increase in training time, no of units
manufactured expected to increase by 0.186 units.
184
(c)(i) Independent variable (X) : Age of car (years)
Dependent variable (Y) : Price ($000)
(iii) r = - 0.867
The value shows a strong, negative (inverse) relationship between age of car and the
selling price.
Question 3
UX
(a)(i) 𝑆𝑎𝑚𝑝𝑙𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛, 𝑝 = dLV = 0.625
Point estimate of µ= p =0.625
(ci) wider
(ii) narrower
(iii) wider
185
Question 4
(a)(i) H0 : µAA = µCC = µSS = µNN
H1: The means are not all equal
Since F statistic < 4.82, do not reject Ho. There is insufficient evidence to conclude any
difference in the mean hospital bills for the 4 hospitals at a=0.01.
𝟖𝟓𝟎𝟎 g 𝟔𝟖𝟎𝟎
(b)(i) P(x>8500) = P(𝒛 > 𝟏𝟐𝟎𝟎
)
= P(z> 1.42)
= 0.5 – 0.4222=0.0778
x = $5,264
Since χ2 < 9.488, do not reject Ho. There is insufficient evidence to conclude a
relationship between restaurant and overall satisfaction rating at a= 5%.
186
ANSWERS TO DISCUSSION AND SUPPLEMENTARY QUESTIONS
187
Session 2: Discussion questions
(d)
2.
3(a)
(b)
188
(b)
5(a)
40
20
0
1500 4500 7500 10500 13500 16500 19500
Amount Spent ($)
Positively skewed.
7(a)
(b) 18%
189
(c)
Session 2 Supplementary
(d)
2(a)
3(a) 4 (b) 6
(c)
190
4(a) `(b)
6(a) 62 (b) 72
(c)
(d)
(e) 12 days
7(a)
191
(d)
8.
Positive relationship
2(a) 7 (b) 0
(c) No.
192
5 2.33; 1.53
1. 21.5; 22.5; 21
3. 41 approximately
2(a) 0.2
193
(b)
4. 6
6. 0.565
7(a)
1. 0.75; 0.9375
2. 0.1
4(a) 0.25
(b) 0.25
6. 0.5625
194
8(a) 0.0158 (b) 0.716
12. 120
13. 10
14. 1320
15. 36
1(a)
Advertising vs Sales
15
Sales ($m)
10
5
0
0 1 2 3 4 5
Advertising ($m)
195
Session 7: Supplementary questions
1(a) 𝑌ž = 25 + 7𝑋
(b) Sales increase $7,000 for every additional sales call made.
(c) $67,000
(d) 0.996. Strong, direct relationship.
2(a)
Income vs Food Expenditure
20
Food expenditure ($00)
15
10
5
0
0 10 20 30 40 50 60
Income ($00)
196
5(a) Independent : flight hours Dependent : Airfare ($)
(b) ž
𝑌 = −20.423 + 131.401𝑥 (c) 0.818 Strong, positive relationship.
(d) 0.668 About 66.8% of variation in airfare is explained by the flight hours.
(e) $636.58
(f) The airfare will cost $262.80 more. (g) No. Extrapolation.
(c) -0.302. Number of demerit points decreases by 0.302 for every additional year of
driving experience.
(d) r = - 0.788 Strong, inverse relationship.
(e) 6 demerit points more. (f) X = 4 (whole number)
2. 51 months
197
5. 0.0667
9. 83.4 marks
3. (0.641; 0.739)
5. 107
7(a) (8.25; 9.35) transactions (b) No, it falls outside the interval.
(c) 119
198
5(a) 3.847 kg
(b) (76.62; 86.18) kg Fertiliser makes a difference to the crop yield.
7. (0.538; 0.762)
1(a) Ho : µ = 16 H1 : µ ¹ 16
(b) Test statistic: z = 0.8; Critical value: z = ± 1.96; do not reject H0.
5(a) Ho : µ ³ 20 H1 : µ < 20
(b) Test statistic: t = -2.86; Critical value: t = -1.711; reject Ho
(c) Probability of Type I error = a = 0.05. Probability of rejecting a true Ho.
1. Ho : µ £ 20 H1 : µ > 20
Test statistic: z = 8.86; Critical value: z = 2.33; reject H0.
3(a) Ho : µ ³ 80 H1 : µ < 80
Test statistic: z = -2.94; Critical value: z = -1.645; reject H0.
(b) Probability of Type I error = a = 0.05. This is the probability of rejecting Ho when it
is true.
199
4(a) Ho : µ = 190 H1 : µ ¹ 190
(b) Test statistic: t = -2.53; Critical value: t = ± 2.262; reject H0.
6(a) Ho : µ ³ 3 H1 : µ < 3
Test statistic: z = -1.77; Critical value: z = -1.645; reject Ho.
(b) p-value =0.038 which is < a=0.05; reject H0.
7(a) Ho : µ ³ 20 H1 : µ < 20
(b) Reject 𝐻V if Test Statistics < −1.645
(c) Test statistic: z = -3.68
(d) Reject H0.
(e) The mean amount is less than 20 ounces, at 0.05 level of significance.
(f) p-value =0.0001 which is < a=0.05; reject Ho.
200
Session 12: Discussion questions
1(a)
2(a) 𝐻V : 𝜇d = 𝜇L = 𝜇T
𝐻d : Not all means are equal.
(b) Reject 𝐻V if F Statistic > 6.93.
(c) SS Total = 120.9333; SSE = 107.2; SST = 13.7333
(d)
(e) Test statistic: F = 0.769; Critical value: F = 6.93; do not reject Ho.
3(a) 𝐻V : 𝜇d = 𝜇L = 𝜇T = 𝜇W = 𝜇X
𝐻d : The mean sales are not all equal.
(b) Reject 𝐻V if F Statistic > 5.99.
(c) Test statistic: F = 22.4423; Critical value: F = 5.99; reject Ho.
201
5(a) k=3; n=20
(b) H0 : µladies = µrevolution = µplatinum
H1: The means are not all equal
(c) F statistic = 0.359
(d) Test statistic: F = 0.359; Critical value: F = 6.11; do not reject Ho.
202
Appendix 1_1
Key Formulas
∑Q
Population Mean 𝜇=
R
∑Q
Sample Mean 𝑥̅ =
&
∑(Qgh)j
Population variance 𝜎L = R
∑(Qgh)j
Population standard deviation 𝜎=k R
∑(QgQ̅ )j
Sample variance 𝑠L = &gd
∑(QgQ̅ )j
Sample standard deviation 𝑠=k
&gd
∑ +x
Sample mean, grouped data 𝑥̅ = &
∑ +(xgQ̅ )j
Sample standard deviation, grouped data 𝑠=k
&gd
„(…Â ).„(ä|…Â )
Bayes’ Theorem 𝑃 (𝐴d |𝐵) = „(… ).„(ä|…
  )I„(…j ).„(ä|…j )
& &!
Number of permutations 𝑃" = (&g")!
& &!
Number of combinations 𝐶" = "!(&g")!
203
Appendix 1_2
Qg h
Standard normal value 𝑧=
¤
Q̅ g h
𝑧= º
√»
¤
Standard error of mean 𝜎Q̅ =
√&
¤
Confidence interval for µ 𝑥̅ ± 𝑧
√&
4
𝑥̅ ± 𝑧
√&
4
𝑥̅ ± 𝑡 with df = n-1
√&
Q
Sample proportion 𝑝=&
K(dgK)
Confidence interval for proportion 𝑝 ± 𝑧k
&
´¤
Sample size for estimating mean 𝑛 = [ µ ]L
´
Sample size for proportion 𝑛 = 𝑝(1 − 𝑝)[µ]L
Q̅ g h
Testing of hypothesis, one mean 𝑧= º
√»
Q̅ g h
𝑧= ¾
√»
𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛
204
Appendix 1_3
KgÃ
Test of hypothesis, one proportion 𝑧=
Ä(ÂÅÄ)
k
»
å (X - X )
2
Sum of Squares, Total SS Total = G
å (X - X )
2
Sum of Squares, Error SSE = C
00ÇD
(Ígd)
F statistic 𝐹= 00µD
(&gÍ)
df (numerator) = k -1
df (denominator) = n - k
(ægµ)j
Chi-Square statistic 𝜒 L = ∑[ µ
]
df = (r-1)(c-1)
& ∑ ¸çg∑ ¸ ∑ ç
Correlation coefficient 𝑟=
k[& ∑ ¸ j g(∑ ¸)j ][& ∑ ç j g(∑ ç)j ]
. ∑ çIè ∑ ¸çg&ç’ j
Coefficient of determination 𝑟L = ∑ ç j g&ç’ j
205
Appendix 2
AREA UNDER THE NORMAL CURVE
Area of 0.4750
0 Z = 1.96
___________________________________________________________________________
Example: To find the area under the curve between the mean and a point 1.96 standard deviations to
the right of the mean, look up the value opposite 1.9 and under 0.06 in the table; 0.4750 of the area
under the curve lies between the mean and a z value of 1.96.
___________________________________________________________________________
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 * 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 * 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
NOTE: For values of z above 3.09, use 0.4999 for the area.
206
Appendix 3
t-DISTRIBUTION
Left tail Right Two
tail tails
207
Appendix 4
208
Appendix 5_1
Values of F for F Distributions with .05 of the Area in the Right Tail
EXAMPLE: For a test of a significance of .05 where we have 15 degrees of freedom for the numerator and 6 degrees of freedom
for the denominator, the appropriate F value is found by looking under the 15 degrees of freedom column and proceeding down
to the 6 degrees of freedom row; there we find the appropriate F value to be 3.94
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25
x 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 4.57 15.20 1.46 1.39 1.32 1.22 1.00
209
Appendix 5_2
Values of F for F Distributions with .01 of the Area in the Right Tail
EXAMPLE: For a test of a significance of .01 where we have 7 degrees of freedom for the numerator and 5 degrees of freedom
for the denominator, the appropriate F value is found by looking under the 7 degrees of freedom column and proceeding down to
the 5 degrees of freedom row; there we find the appropriate F value to be 10.5
1 4.052 5.000 5.403 5.625 5.764 5.859 5.928 5.982 6.023 6.056 6.106 6.157 6.209 6.235 6.261 6.287 6.313 6.339
2 98.5 99 99.2 99.2 99.3 99.3 99.4 99.4 99.4 99.4 99.4 99.4 99.4 99.5 99.5 99.5 99.5 99.5
3 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 26.9 26.7 26.6 26.5 26.4 26.3 26.2
4 21.2 18.0 16.7 16 15.5 15.2 15.0 14.8 17.7 14.5 14.4 14.2 14.0 13.9 13.8 13.7 13.7 13.6
5 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1 9.89 9.72 9.55 9.47 9.38 9.29 9.20 9.11
6 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56 7.40 7.31 7.23 7.14 7.06 6.97
7 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74
8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 5.36 5.28 5.20 5.12 5.03 4.95
9 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40
10 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4.00
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.25 4.10 4.02 3.94 3.86 3.78 3.69
Degrees of freedom for denominator
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 3.86 3.78 3.70 3.62 3.54 3.45
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.82 3.66 3.59 3.51 3.43 3.34 3.25
14 8.86 6.51 5.56 5.04 4.70 4.46 4.28 4.14 4.03 3.94 3.80 3.66 3.51 3.43 3.35 3.27 3.18 3.09
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52 3.37 3.29 3.21 3.13 3.05 2.96
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.41 3.26 3.18 3.10 3.02 2.93 2.84
17 8.40 6.11 5.19 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.31 3.16 3.08 3.00 2.92 2.83 2.75
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.23 3.08 3.00 2.92 2.84 2.75 2.66
19 8.19 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.15 3.00 2.92 2.84 2.76 2.67 2.58
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.03 2.88 2.80 2.72 2.64 2.55 2.46
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 2.98 2.83 2.75 2.67 2.58 2.50 2.40
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.93 2.78 2.70 2.62 2.54 2.45 2.35
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.89 2.74 2.66 2.58 2.49 2.40 2.31
25 7.77 5.57 4.68 4.18 3.86 3.63 3.46 3.32 3.22 3.13 2.99 2.85 2.70 2.62 2.53 2.45 2.36 2.27
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.47 2.39 2.30 2.21 2.11
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.52 2.37 2.29 2.20 2.11 2.02 19.2
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53
x 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.70 1.59 1.47 1.32
210