You are on page 1of 121

UNIVERSITAS PELITA HARAPAN

BUSINESS SCHOOL

9th Edition
MANAGEMENT LAB MODULE

ADVANCED
BUSINESS STATISTICS

Head of Programme
Dr. Valentino Budhidarma, S.Kom., M.M.

Subject Coordinator
Dr. Vina Christina Nugroho, S.E., M.M.

Head of Assistant Lab Coordinator


Sylvia Samuel, M.IBL

Prepared By
Lab Assistant Coordinator
Darren Kimi
Lab Assistant Team
Christoforus Axel Billikusuma
Ferren Aurelia
Patricia Angelica
TABLE OF CONTENTS

FOREWORD 3
INTRODUCTION 4
INSTRUCTION PROGRAM OUTLINE (SAP) 7
MODULE 1 : ESTIMATION AND CONFIDENCE INTERVAL 8
REVIEW QUESTIONS 11
MODULE 2 : ESTIMATION AND CONFIDENCE INTERVAL 12
REVIEW QUESTIONS 15
MODULE 3 : ONE-SAMPLE TEST HYPOTHESIS 16
REVIEW QUESTIONS 20
MODULE 4 : HYPOTHESIS TESTING 23
REVIEW QUESTIONS 27
MODULE 5 : ANALYZING THE DIFFERENCE IN TWO POPULATION 28
REVIEW QUESTIONS 34
MODULE 6 : TWO SAMPLE TEST OF HYPOTHESIS 36
REVIEW QUESTIONS 42
MODULE 7 : ANALYSIS OF VARIANCE 43
REVIEW QUESTIONS 56
MODULE 8 : CORRELATION AND LINEAR REGRESSION 57
REVIEW QUESTIONS 61
MODULE 9 : ESTIMATING Y VALUE 63
REVIEW QUESTIONS 72
MODULE 10 : MULTIPLE REGRESSION ANALYSIS 74
REVIEW QUESTIONS 80
MODULE 11 : NONPARAMETRIC : GOODNESS OF FIT TEST 83
REVIEW QUESTIONS 93
MODULE 12 : NONPARAMETRIC : ANALYSIS OF ORDINAL DATA 96
REVIEW QUESTIONS 109
APPENDIX 110
FOREWORD

Introduction to Business Statistics is the foundation for many statistics classes. From this subject,
students will learn more about how to apply statistical equations and diagrams through business
appropriateness. The students will embark on a statistical journey of knowing why and how the
world is being done with great examples from the past and current situations.

This class will let the students find out the basics from Business Statistics, which is to find out the
elements that are needed in order for us to estimate through hypothesis. We will learn on how to
find the mean, median, mode, standard deviation, variance, and many other basics that are essential
in creating our own hypothesis.

This module has been compiled in a way that these purposes might be achieved. It contains the key
elements of each chapter, followed by specifically chosen exercises to be solved and discussed in
each meeting with the lab assistant.

The aim of this module is to help students understand more about Intermediate to Business Statistics
in order for the students to be able to move into a more advanced statistics education. We will try
our best for the students to be able to apply the knowledge from Intermediate to Business Statistics
for the cases that will be given and also for more advanced statistical subjects.

I would like to express a special thanks to Universitas Pelita Harapan and Dra. Gracia Shinta S.
Ugut, M.B.A., Ph.D., as The Head of Management Department, and Mrs. Sylvia Samuel, M.IBL as
The Lab Assistant Coordinator for giving us the opportunity to create this module for the student to
learn and understand more about this course subject.

I also want to thank Christoforus Axel Billikusuma, Ferren Aurelia, Patricia Angelica who have
made lots of contributions to make this module together. This module wouldn’t exist without them.
And for the students, I hope that we can learn well together with their own laboratory assistant and
don’t hesitate to ask and give feedback because the laboratory assistants are still university students
so they also had to learn from the experience of having to teach a class from the students. It is our
wish that you can use these statistical techniques for your tools in the future. May God bless all of
you and grant you wisdom throughout the course. On behalf of the Laboratory Assistant Team, I
wish you all great success and have a great journey. Good Luck!

Lab Assistant Coordinator,

Darren Kimi (Business Management, 2021).


INTRODUCTION

A. Description
Laboratory subject is related to the main subject (Theory), which cannot be separated. The
purpose of laboratory subjects is to make the student be able to understand the concept of
the subject by exercising themselves in problems and cases. All laboratory subjects are 0
credit but the duration of the class is 120 minutes which is equivalent to 2 credits.

B. General Purpose Instruction


After involving in this class and doing all of the materials, students are expected to be
able to do identification/explaining/calculating/analyzing the concept about:

1. Visualizing data on the graphs


2. Describing data through statistics approaches
3. Basic Regression and Index Number
4. Counting probabilities
5. Discrete Distribution
6. Continuous Distribution
7. Sampling Distribution

C. Lecture Activities
The students are directed to involve actively in the class learning process.

1. To facilitate the learning process, the students must read the chapter on the reference
book that is related to the class material. Students are also able to read the brief theory
that is provided in each module.
2. The questions that are provided in this module are only the materials that partially
have been taught in the theory subject.
3. Students must do the questions on the module individually based on the instruction of
the laboratory assistant, do quizzes that will be held, follow the laboratory mid-exam
and final exam based on the given schedule.
D. Class Rules

1. Attendance

At least attend 85% attendance, which is equivalent to attending 11 out of 12 sessions.

2. Lateness

>10 Minutes regarded as absent

3. Permission Exception

1. Formal permission from university or faculty


2. Hospitalized (maximum 2 weeks)
3. Sudden passing of core family member (with supported documents).Grading Composition

E. The final grade is the sum of the student’s theory and lab score with a composition of
80% theory class and 20% lab course.

Below are the components of the lab course grading:

Mid Test: 35%

Final Test: 35%

Quiz 1: 10%

Quiz 2: 10%

Assignment: 10%

*Note: Quiz 1, Quiz 2 and Assignment will be accounted as KAT.


F. Grading Scale

Score Grade
90 – 100 A
85 – 89.99 A-
80 – 84.99 B+
75 – 79.99 B
70 – 74.99 B-
65 – 69.99 C+
60 – 64.99 C
55 – 59.99 C-
0 – 54.99 F
INSTRUCTION PROGRAM OUTLINE
(SAP)
WEEK MODULE CHAPTER MATERIAL
Point estimate, Confidence interval population
mean (deviation standard known/not known),
Module Population correction factors
1 Chapter 9
1,2 Confidence interval population proportion, Sample
Size, Finite correction factor
Hypothesis testing
- population mean
2 Module 3 Chapter 10, 15 - One and two ways hypothesis testing
- Type I error and Type II error;
- z and t distributions
Hypothesis testing
- population proportion
3 Module 4 Chapter 10, 15
- One and two ways hypothesis testing;
- p value
Two sample test; population mean; independent
4 Module 5 Chapter 11, 15 and dependent samples (variances known/not
known)
Two sample test; population mean; independent
5 Module 5 Chapter 11, 15 and dependent samples (variances known/not
known)
Two sample test; population proportion;
6 Module 6 Chapter 11, 15
population variances
7 MID-TERM EXAM
F Distribution; compare two population variances,
8 Module 7 Chapter 12
assumptions in ANOVA, ANOVA testing
Correlation analysis define, correlation coefficient,
9 Module 8 Chapter 13 determination coefficient, significant test,
assumptions on regression analysis
10 Module 9 Chapter 13 Estimation for single value Y

Multiple regression analysis; model fit; inference on


11 Module 10 Chapter 14
regression analysis

12 Module 11 Chapter 15 Goodness of fit test; contingency table analysis


Sign test, Wilcoxon Signed-Rank, Kruskal-Wallis,
13 Module 12 Chapter 16
rank-order correlation
14 FINAL EXAM
MODULE 1
ESTIMATION AND CONFIDENCE INTERVAL (1)
By Darren Kimi

Point estimate for a population mean


Point estimate is a single statistic used to estimate a population parameter.

The following examples illustrate point estimates of population means:


Recent medical studies indicate that exercise is an important part of a person’s overall health.
The director of human resources at OCF, a large glass manufacturer, wants an estimate of a
number of hours per week employees spend exercising. A sample of 70 employees reveals
the mean number of hours of exercise last week is 3.3. This value is a point estimate of the
unknown population mean.
The sample mean(𝑥̅), is not the only point estimate of a population parameter. For example, a
sample proportion (p) is a point estimate of the population proportion (𝜋), and the sample
standard deviation (s) is a point estimate of the population standard deviation(𝜎).

Confidence intervals for a population mean

To compute a confidence interval for a population mean, we will consider two situations:
● We use sample data to estimate 𝜇 with 𝑥̅, and the population standard deviation (𝜎) is
known.
● We use sample data to estimate 𝜇 with 𝑥̅, and the population standard deviation (𝜎) is
unknown. In this case, we substitute the sample standard deviation (s) for the population
standard deviation (𝜎).
Population standard deviation (𝝈) known
A confidence interval is computed using two statistic: the sample mean(𝑥̅)and the standard
deviation(𝜎). In computing a confidence interval, the standard deviation is used to compute
the limits of the confidence interval.
A confidence interval for the population mean when the population follows the normal
distribution and the population standard deviation is known is compute by:

Example:
Del Monte foods distributes diced peaches in 4.51 ounce plastic cups. To ensure that
each cup contains at least the required amount, Del Monte sets the filling operation to
dispense 4.51 ounces of peaches and gel in each cup. From historical data, Del Monte knows
that 0.04 ounce is a standard deviation of the filling process and follows the normal
probability distribution. The quality control technician selects a sample of 64 cups at the start
of each shift, this morning the sample of 64 cups had a sample mean of 4.507 ounce. Using
95% confidence interval for the population mean.
● Step 1
𝑥̅ = 4.507 ounce
𝜎 = 0.04 ounce
n = 64 cups
confidence interval = 95%
● Step 2
Compute confidence level to get z value:
1. First, we divide the confidence level
in half, so. 95%/2 = 0.4750
2. Find the value 0.4750 in the body of
table
3. Locate the corresponding row value in the left margin, which is 1.9, and the column
value in the top margin, which is 0.06. adding the row and column values gives us a Z –
value of 1.96
● Step 3

̅± → 4.507 − 1.96 and 4.507 + 1.96


The 95% confidence interval estimates that the population mean is between 4.4972 ounces
and 4.5168 ounces of peaches and gel.

Population standard deviation(𝝈) unknown


In this sampling situation the population standard deviation (𝜎) is unknown. We use the
sample standard deviation (s) and replace the z distribution with the t distribution.

To develop a confidence interval for the population mean using the t distribution.

Example:
A tire manufacturer wishes to investigate the tread life of its tires. A sample of 10 tires driven
50,000 miles revealed a sample mean of 0.32 inch of tread remaining with a standard
deviation of 0.09 inch. Construct a 95% confidence interval for the population mean. Would
it be reasonable for the manufacturer to conclude that after 50,000 miles the population mean
amount of tread remaining is 0.3 inch?
● Step 1

𝑥̅ = 0.32 inch
𝑠 = 0.09 inch
n = 10 tires
Confidence interval = 95%
● Step 2
Df = n – 1
= 10 – 1
Df = 9
Confidence interval = 95%
Find the value of t on table t – distribution is 2.262.
● Step 3
𝑥̅± → 0.32 − 2.262 and 0.32+ 2.262
The manufacturer can be reasonably sure (95% confident) that the mean remaining tread
depth is between 0.256 and 0.384 inch. Because the value of 0.30 is in this interval, it is
possible that the mean of population is 0.3
REVIEW QUESTIONS

Problem 1.1
Nijisanji Limited has calculated its employee activity score for the last 15 days. The
information below shows the result for the activity:

6 , 18 , 6 , 5 , 14, 13 , 23 , 11 , 21 , 22 , 7 , 9 , 20, 12 , 21 , 16, 20 , 19 ,17 , 22 , 15 , 20 , 22

a. Determine the mean and the standard deviation of the sample


b. Develop a 95% confidence interval for the population mean. Interpret the result.
c. Explain why the t – distribution is used as a part of the confidence interval.

Problem 1.2
A report from a nearby neighborhood regarding tax payment was issued. A random sample of
65 of these reports showed the mean amount of tax was $45.000 with a sample standard
deviation of $8.000. What is a 98% confidence interval for the mean amount of the tax
payment?

Problem 1.3
A quarterly financial statement discusses issues of Ligma Effect with economic conditions in
nearby companies. In a sample of 30 statements, the mean was $23,456 with the standard
deviation of the sample was $3,456

a. Based on the information above, develop a 90% confidence interval for the population
mean.

b. Would it be reasonable to conclude that the population mean is $22,380?


MODULE 2
ESTIMATION AND CONFIDENCE INTERVAL (2)
By Darren Kimi

Confidence Interval for a Proportion


Proportion is the fraction, ratio, or percent indicating the part of the sample or the population
having a particular trait of interest. As an example of a proportion, a recent survey indicated
that 92 out of 100 surveyed favored the continued use of daylight savings time in the summer.
The sample proportion is 92/100, or .92, or 92 percent. If we let p represent the sample
proportion, X the number of “successes,” and n the number of items sampled, we can
determine a sample proportion as follows.

Sample Proportion

p=

And to develop a confidence interval for a population proportion, we use:

Confidence Interval for Population Proportion

p±z

Example :
The owner of Shell wishes to determine the proportion of customers who use a credit card or
debit card to pay at the pump. She surveys 40 customers and finds that 20 paid at the pump.
(using 95 percent confidence interval)
a) Estimate the value of the population proportion
b) Develop a 95 percent confidence interval for the population proportion
Solution :
a) p = = = 0.5
b) p ± z =
0.5 ± 1.96 =
= 0.345 and 0.655
So, the 95% confidence interval estimates that the value of the population proportion is
between 0.345 and 0.6555

Choosing an Appropriate Sample Size

a) Sample Size to Estimate a Population Mean

To estimate a population mean, we can express the interaction among these three factors and
the sample size in the following formula. Notice that this formula is the margin of error used to
calculate the endpoints of confidence intervals to estimate a population mean!

Solving this equation for n yields the following result.

Sample Size to Estimate a Population Mean

where:
n is the size of the sample.
z is the standard normal value corresponding to the desired level of confidence.
σ is the population standard deviation. E is the maximum allowable error.

Example :
A population is estimated to have a standard deviation of 10. We want to estimate the
population mean within 2, with a 99% level of confidence interval. How large a sample is
required?
Solution:
b.) Sample Size to Estimate a Population Proportion

To determine the sample size for a proportion, the same three variables need to be specified:
1. The margin of error.
2. The desired level of confidence.
3. The variation or dispersion of the population being studied. For the binomial
distribution, the margin error is:

Solving this equation for n yields the following equation

Sample Size to Estimate a Population Proportion

where:
n is the size of the sample.
z is the standard normal value corresponding to the desired level of confidence.
E is the maximum allowable error
𝜋 is the population proportion Note :

If 𝜋 is unknown, the value of 𝜋 is 0.5

Example :
Suppose the President of the United States wants an estimate of the proportion of the
population who support his current policy toward revision in the international market system.
The President wants to estimate 0,6 of the true proportion. Assume a 95% level of confidence.
The Prime Minister political advisor estimated the proportion supporting the current policy to
be 0,4. How large is the sample required?
Solution :
Finite-Population Correction Factor

The key to know whether we are using the finite correction or not is when the sample size is
equal or greater than 5% of the population.

If we wished to develop a confidence interval for the mean from a finite population and the
population standard deviation was unknown, we would adjust formula as follows

Example:
There are 173 families in Seoul, Korea. A random sample of 35 of these families revealed the
mean annual sanitary contribution was $399 and the standard deviation was $69. What is the
best estimate of population mean? (90% confidence interval)

Solution :

= 381,3342 and 416,6658


REVIEW QUESTIONS

Problem 2.1
Suppose a market research company is hired to estimate the percentage of adults who live in
big cities that have PCs. 400 adult residents randomly selected in this city were surveyed to
determine if they had a PC. Of the 400 people surveyed, 265 said yes, they have a PC. Using
a 95% confidence level, calculate the estimated confidence interval for the actual proportion
of adult residents in this city who have PCs.

Problem 2.2
An analyst in a flight company wants to determine the means of pilots in small cities earn per
month as a pilot. The error in estimating the mean is to be less than $155,000 with a 90
percent level of confidence. The analyst found a report that estimated the standard deviation
to be $190,000. What is the required sample size?

Problem 2.3
A retailer would like to estimate the proportion of their customers who bought an item after
viewing their online website. The retailer wants the margin of error to be within 0.65 of the
population proportion, the desired level of confidence is 95%, and no estimate is available of
the population proportion. What is the required sample size ?

Problem 2.4
Thirty people from a population of 300 were asked how much they had in savings. The
sample mean was $150,000 with a sample standard deviation of $90,000. Construct a 95%
confidence interval estimate for the population mean.
MODULE 3
ONE-SAMPLE TESTS OF HYPOTHESIS (1)
By Darren Kimi

What is a Hypothesis?
A hypothesis is a statement about a population parameter subject to verification.
Business researchers often develop hypotheses that can be studied and explored to find
answers. Hypotheses are tentative explanations of a principle operating in nature.

Six-Step Procedure for Testing a Hypothesis

1. State the null and alternative hypotheses.


2. Select the level of significance / confidence level.
3. Identify the test statistic (What kind of table will it use later on?)
4. Formulate the decision rule.
5. Take a sample and arrive at your decision.
6. Interpret (put into words) the results that you have achieved.

The Formula

If H1: µ ≠ . . . , it means two tailed test, so the alpha (α) must


divided by 2

If H1: µ> or <. . . , it means one tailed test


H0 is the null hypothesis
H1 is the alternate hypothesis
Population Mean (σ known)

Example :
A Survey had been conducted across Asia, that the average net income for the electronic
industry is $85.621. The survey takes a random sample of 132. Assume the population
standard deviation of net income is $15.250. α = 5%. A sample mean of $95.874 is
known.

Step 1: null and alternate hypotheses


H0 : µ = $ 85.621
H1 : µ ≠ $ 85.621
Two tailed test!

Step 2: Select a level of significance


α = 5%, because it is two tailed test, so the α is divided by 2. α/2 = 0.025

Step 3: Select the test statistic


The test statistic is z when the population standard deviation is known.

Step 4: Formulate the decision Rule


The test is two tailed and α = 5%, α/2 = 0.025 (look at H0 and H1). So there is a 0.4750 area
between the mean and each of the critical values that separate the tails of the distribution.
Calculate 0.5-0.025 = 0.475. By using this 0.4750 area and the z table, the critical value
can be obtained.
𝑍𝛼⁄2 = ±1.96 → 𝒛 𝒕𝒂𝒃𝒍𝒆

If the 𝑍𝑣𝑎𝑙𝑢𝑒is not between -1.96 and +1.96, reject the null hypothesis (H0). If 𝑍𝑣𝑎𝑙𝑢𝑒 between
-1.96 and +1.96, do not reject the null hypothesis (H0).
Step 5: Make a decision

x̅ = $95,874 , n = 132
σ = $ 15,250 , µ = $ 85,621
→ → 7.7245

We get the Z value of 7.7245. Since the value of 7.7245 does not fall between the values of -
1.96 and +1.96, then we can conclude that the null hypothesis is false (reject the null
hypothesis).
Step 6: Interpret the result

H0 is rejected, so the average net income for the electronic industry is not equal to $ 85.621. It
can be higher or lower.

Testing Mean with Finite Population

Example:
Using the previous study case sample, but in addition, there are 500 electronic industries in
Asia, but the sample taken is only for 132 people only.

So we should use this formula,

N = 500

Testing Hypotheses About Population Mean using T statistic (σ unknown)

The Formula

df = n – 1
df = degree of freedom

How to calculate for s

̅
𝑋 is the sample mean.

µ is the hypothesized population mean.


s is the sample standard deviation.
n is the number of observations in the sample.

Example :

The weight of sampled Air Conditioner


23.1 21.2 30.2 26.1 22.4
26.5 29.1 27.6 28.6 20.0
24.7 20.8 23.4 23.0 27.8
29.6 24.7 22.8 29.3 26.7
̅
𝑋 = 25.38 , s= 3.1917, n = 20

Known:
α = 5%, n = 20,
df = n-1 = 19
𝐻0 : µ = 25
𝐻1 : µ ≠ 25
Two tailed; Alpha must be split which yields α/2 = 0.025
T0.025, 19 = ±2.093 → t table

*using this formula:

Tvalue = = 0.53

(By comparing ttable and tvalue) the observed t is 0.53, the observed value is between -2.093
and +2.093, so the null hypothesis (𝐻0) is not rejected.

It means the population average weight of the Air Conditioner is equal to 25.
One tailed Test

In the first case, we wanted to know whether there was a difference in the mean number
assembled, but now we want to know whether there has been an increase or decrease. Because
we are investigating different questions, we will set our hypothesis differently. The biggest
difference occurs in the alternate hypothesis. Before, we stated the alternate hypothesis as
“different from”; now we want to state it as “greater than” or “lower than” In symbols:

One tailed Two tailed


𝐻0 : µ ≤ 50 𝐻0 : µ = 50
𝐻1 : µ ˃ 50 or µ < 50 𝐻1 : µ ≠ 50

The critical values for a one-tailed test are different from a two-tailed test at the same
significance level. The formula to test the hypothesis is the same with the formula we
previously discussed. The difference is to calculate the α.

In the previous example, we split the significance level (α) in half and put half in the lower
tail and half in the upper tail. In a one-tailed test, we put all the rejection region in one tail.

Example :
A sample of 32 observations is selected from a normal population. The sample mean is 14,
and the population standard deviation is 4. Conduct the following test of hypothesis using the
.05 significance level.
𝐻0: mean is less than or equal to 10
𝐻1: mean is greater than 10

Solution :
𝐻0: µ ≤ 10
𝐻1: µ > 10 α = 0.05
Calculate 0.5 - 0.05 = 0.45. By using this 0.45 area and the z table, the critical value can be
obtained.
𝑍α = ±1.96 → 𝒛 𝒕𝒂𝒃𝒍𝒆
x̅ = 14 , n = 32
σ = 4 , µ = 10

Using this formula:

𝑍𝑣𝑎𝑙𝑢𝑒 = 5.66

Because the 𝑍𝑣𝑎𝑙𝑢𝑒 is higher than +1.96, reject the null hypothesis (H0).
It means the population mean is greater than 10.
REVIEW QUESTIONS
Problem 3.1
The mean work weeks for an accountant in LPH is believed to be about 70 hours. A newly
hired accountant hopes that the length is shorter. He asks 10 of her accountant friends in other
firms for the lengths of their mean work weeks. Below is the data (lengths of mean work
week).
55 60 55 60 65 66 66 70 50 45
Based on the data above, should she count on the mean work week to be shorter than 70
hours? Use the .01 significance level.

Problem 3.2
Better Supply Co. manufactures and assembles wooden tables in several plants in Jakarta.
Plant Z produced 150 wooden tables every week. Plant Z follows a normal probability
distribution with a mean of 300 and a standard deviation of 80. Recently, because of market
expansion, new production methods have been introduced and new employees were hired.
The head manager of the manufacturing department would like to investigate whether there
has been a change in the weekly production of the wooden tables. Is the mean number of
tables produced by Plant Z different from 250 at the .05 significance level?

Problem 3.3
Given the following hypothesis:
H0 : µ ≤ 15
H1 : µ ˃ 15
A random sample of 30 observations is selected from a normal population. The sample
mean was 17 and the sample standard deviation 8. Using the .05 significance level:
a. State the decision rule.
b. Compute the value of the test statistic.
c. What is your decision regarding the null hypothesi
MODULE 4 HYPOTHESIS TESTING
by Patricia Angelica

Testing Concerning Proportion


In the previous chapter, we discussed confidence intervals for proportions. We can also
conduct a test of hypotheses for a proportion. Recall that a proportion is the ratio of the
number of successes to the number of observations. Thus, the formula for computing a
sample proportion, p, is
X: number of successes
n: number of observation

To test a hypothesis about a population proportion, a random sample is chosen from the
population Some assumptions must be made and conditions met before testing a population
proportion.:
1. The sample data collected are the result of counts
2. the outcome of an experiment is classified into one of two mutually exclusive
categories—a “success” or a “failure”
3. The probability of a success is the same for each trial
4. The trials are independent, meaning the outcome of one trial does not affect the
outcome of any other trial.
The test we will conduct shortly is appropriate when both n𝜋and n(1-𝜋 ) are at least 5. nis the
sample size, and 𝜋 is the population proportion. It takes advantage of the fact that a binomial
distribution can be approximated by the normal distribution.
We can determine the formula to calculate proportion hypothesis testing as follows:

Where:
: 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛
p: sample proportion
n: sample size
One-tail proportion hypothesis testing
Suppose prior elections in a certain state indicated it is necessary for a candidate for governor
to receive at least 80 percent of the vote in the northern section of the state to be elected. The
incumbent governor is interested in assessing his chances of returning to office and plans to
conduct a survey of 2,000 registered voters in the northern section of the state and it is
revealed that 1,550 voters planned to vote for incumbent governor.Using the hypothesis-
testing procedure, assess the governor’s chances of reelection.

Step 1: State the hypothesis


Governor is concerned only when the proportion is less than 0.80. if it is equal to or greater
than 0.8, he have no problem. So, the hypothesis written as follows:

H0: 𝜋 ≥ .80

H1: 𝜋 <.80
Step 2: Select level of significance
One- tail hypothesis testing with level of significance or 𝛼 of 0.5.
Step 3: Select the test statistics
z is the appropriate statistic,
Step 4: Formulate decision rule

One tail and the alternate hypothesis state the direction to the left, so only left side of curve is
used. Significance level was 0.05, so the area between zero and critical value is 0.45 (0.5-
0.05). so, we can find out that the critical value of z is -1.65. The decision is therefore to
reject the null hypothesis if the computed value falls to the left of -1.65.
Step 5: Make decision and interpret result

The computed value of z (-2.08) is in the rejection region, so the null hypothesis is being
rejected at the 0.05 level. It indicates that the evidence at this point does not support the claim
that the incumbent governor will return to the governor’s mansion for another four years.

Two tail proportion hypothesis testing


The following hypotheses are given.

H0: 𝜋 = .40

H1: 𝜋 ≠.40
A sample of 120 observations revealed that p =.30. At the .05 significance level, can the null
hypothesis be rejected?

Step 1: state the hypothesis

H0: 𝜋 = .40

H1: 𝜋 ≠.40
Step 2: select level of significance
Two tail hypothesis testing with confidence interval of 95%.
Step 3: select the test statistics
z is the appropriate statistic,
Step 4: formulate decision rule
This alternate hypothesis does not state direction, so this is a two tailed test. Both sides of the
curve are used. Significance level was 0.025 (0.05/2) ,the area between zero and critical value
is 0.475 (0.5-0.025). so, we can find out that the critical value of z is ±1.96. The decision is
therefore to reject the null hypothesis if the computed value is on the rejection region
Step 5: make decision and interpret result
The computed value of z (-2.24) is in the rejection region, so the null hypothesis is being
rejected at the 0.05 level. It indicates that population proportion is not equal to 0.4

P-value in hypothesis testing


This approach reports the probability (assuming that the null hypothesis is true) of getting a
value of the test statistic at least as extreme as the value actually obtained. This process
compares the probability, called the p-value, with the significance level. If the p-value is

smaller than the significance level, H0 is rejected. If it is larger than the significance level, H0

is not rejected.

Determining the p-value not only results in a decision regarding H0, but it gives us additional

insight into the strength of the decision. A very small p-value, such as .0001, indicates that

there is little likelihood the H0 is true. On the other hand, a p-value of .2033 means that H0 is

not rejected, and there is little likelihood that it is false.


Example: if z=1.96, we know that P(1.96)= 0.475 (looking the outside to find the number

inside table). So, we know that p-value= 0.025 (0.5-0.475). if alpha = 0.05, H0 is rejected

H
because p- value< alpha. But, if alpha = 0.01, 0 is accepted because p-value > alpha
REVIEW QUESTIONS

PROBLEM 4.1
Research done at Clarabell Company showed that 35 percent of its workers had worked from
home this year. Clarabell Company had employed 100 workers; only 20 did work from home
last year.. Use the five-step hypothesis-testing procedure at the 0.1 significance level to test
whether this data is opposite / contradicts the research report. and what is the p-value and
what does that imply?

PROBLEM 4.2
A polling done at Sera Institution indicates that 20 percent of workers had their turnover on
the first year of their job. A random sample of 225 workers revealed that 70 had their
turnover after the first year of the program. Has there been a significant decrease in the
proportion of students who change their major after the first year in this program? Use the .02
level of significance.

PROBLEM 4.3
The following hypothesis are given

H0: 𝜋 ≥ .45

H1: 𝜋 <.45
A sample of 280 observations revealed that p =.80 . At the .05 significance level, can the null
hypothesis be rejected?
a. State the decision rule.
b. Compute the value of the test statistic.
c. What is your decision regarding the null hypothesis?
MODULE 5
ANALYZING THE DIFFERENCES IN TWO POPULATION
by Patricia Angelica

Two-Sample Tests of Hypothesis : Independent Samples


There are 3 criterias for using z table formula :
● The samples are from independent populations.
● Both populations follow the normal distribution.
● Both populations’ standard deviations are known.

The z formula for the differences in two sample means is :

DEMONSTRATION PROBLEM I
As part of a study of corporate employees, the director of human resources for PNC Inc.
wants to compare the distance traveled to work by employees at its office in downtown
Cincinnati with the distance for those in downtown Pittsburgh. A sample of 35 Cincinnati
employees showed they travel a mean of 370 miles per month. A sample of 40 Pittsburgh
employees showed they travel a mean of 380 miles per month. The population standard
deviation for the Cincinnati and Pittsburgh employees are 30 and 26 miles, respectively. At
the 0.05 significance level, is there a difference in the mean number of miles traveled per
month between Cincinnati and Pittsburgh employees?

SOLUTION :
Step 1 : State the null hypothesis and the alternate hypothesis

H0: 𝜇1 = 𝜇2

H1: 𝜇1 ≠ 𝜇2

Step 2 : Select the level of significance


α = 0.05 (two tailed)

Step 3 : Determine the test statistic

Step 4 : Formulate a decision rule


Critical Value = (95% / 2 ) = (0.4750) = 1.96

Step 5 : Make the decision regarding H0 and interpret the result

The computed value of 1.532 is smaller than the critical value 1.96. Our decision is to accept
the null hypothesis (Ho)

Two Sample Tests about Proportions


If we want to test the difference between two sample proportion, we use z table formula :
DEMONSTRATION PROBLEM II
A group of researchers attempted to determine whether there was a difference in the
proportion of consumers and the proportion of CEOs who believe fear of getting caught or
losing one’s job is a strong influence of ethical behaviour. In their study, they found that 57%
of consumers said that fear of getting caught or losing one’s job was a strong influence on
ethical behaviour, but only 50% of CEOs felt the same way. Suppose these data were
determined from a sample of 755 consumers and 616 CEOs. Does this result provide enough
evidence to declare that a significantly higher proportion of consumers than CEOs believe
fear of getting caught or losing one’s job is a strong influence on ethical behaviour?(𝛼 = 0.1)

SOLUTION
Step 1: State H0 and H1
We know that the hypothesis is one-tailed ones, because we are trying to prove whether
consumers’ proportion is significantly higher than the CEOs regarding of their beliefs. Thus
we have :
H0 : π1 < π2
H1 : π1 > π2

Where :
π1 = the proportion of consumers
π2 = the proportion of CEOs

Step 2 : Select the level of significance and find the critical

Z value As we know that the 𝛼 = 0.1 and it is one-tailed test (>).

Zα = 0.5 – 0.1 (significance level) = (0.4000)

So, we got the critical z value of 1.28

Step 3 : Calculate the 𝑝𝑐

Step 4 : Calculate the z


Step 5 : Make a decision and interpret the answer
Since the observed z is higher than the critical z value, we will reject the null hypothesis and
accept the alternatives hypothesis (H1). This calculation proved that there is significantly
higher proportion of consumers than CEOs who believe fear of getting caught or losing one’s
job is a strong influence on ethical behaviour.

Independent Samples with Unknown Population Standard Deviation


If the sampled populations have unknown standard deviation, we use t table as the test statistic. The t
formula for the difference in two sample means is :
a)Assuming both population variances are equal :

b) Assuming both population variances are not equal :

In a one-tailed test (population means difference will be higher or lower than something), the
rejection region is α in the respective tail (left or right). In a two-tailed test (population means
difference is equal with something), the rejection regions are α/2 in both left or right.
(α=100% - confidence level). After that, we must find the t value of it to be able to know
where the calculated t is located (in the rejection area or not).

DEMONSTRATION PROBLEM III


A recent article in The Wall Street Journal compared the cost of adopting children from
China with that of Russia. For a sample of 16 adoptions from China, the mean cost was
$11,045, with a standard deviation of $83. For a sample of 18 adoptions from Russia, The
mean cost was $12,840, with a standard deviation of $1,545. Can we conclude the mean cost
is larger for adopting children from China? Assume the two population standard deviations

are not the same. Use the .05 significance level.

SOLUTIONS

Step 1 : State HO and H1


H0 : μ 1 ≤ μ 2
H1 : μ 1 > μ 2
Step 2 :Statistical test is formula

Step 3 : The value alpha is 0.05


Step 4 : Find degrees of freedom to determine the t critical table valu

t .05,27 (one tailed)=


1.703

Reject if the observed t >1.703

Step 5 : Find the observed t value

Step 6 : Make a decision


Because the observed t
value is lower than the
critical t table value, the
decision is to accept the null hypothesis.
Step 7 : Interpret your answer
The mean adoption cost from China is less than or equal the mean adoption
cost from Russia.

Two Sample Tests of Hypothesis : Dependent Samples


For dependent samples, we assume the distribution of the paired differences between
the populations has a mean of 0.

a. We first compute the mean and the standard deviation of the sample differences.
b. The value of the test statistic is computed from the following formula :

𝑑𝑓 = 𝑛 − 1

Where :
̅
𝑑 = mean sample difference
Sd = standard deviation of sample difference
n = number of pairs

DEMONSTRATION PROBLEM IV
The management of Discount Furniture, a chain of discount furniture stores in the North-east,
designed an incentive plan for salespeople. To evaluate this innovative plan, 12 salespeople
were selected at random, and their weekly incomes before and after the plan were recorded.
Was there a significant increase in the typical salesperson’s weekly income due to the
innovative incentive plan? Use the .05 significance level. Interpret your answer SOLUTION
Step 1 :State HO and H1
H0 : μd ≥ 0
H1 : μd < 0
Step 2 : Calculate d and sd

Step 3 :Find the t table value


After we calculate the degree of freedom, we can find the t table value
The degree of freedom are 12-1 = 11, with the α = .05. The t table value t0.05,11 (one tailed) =1.796.
The decision rule is to reject the null hypothesis if the observed value is greater than 1.796
Step 4 : Calculate the observed t value

Step 5: Make a decision and interpret your answerThe observed value is greater than the t
table value, so the decision is to reject the null hypothesis. The incentive plan resulted in a
decrease in daily income.
REVIEW
QUESTIONS
PROBLEM 5.1
There are 270 men who have tested the new launch salted egg rice box, 100 of them like the
taste and the packaging. Meanwhile, from a group of 350 men, 150 of them like the taste and
the packaging. At the 0.10 significance level, can we conclude that there is a significant
different of proportion between women and men who like the taste amd the packaging of the
new launch salted egg rice box Determine the p value!

PROBLEM 5.2
Lisa observes the difference on sales between group Papoy and group Pipoy. The 70 days
sample show that group Papoy sold 1700 smartphones in average per day. Meanwhile, the 80
days sample show group Pipoy sold 1800 smartphones in average per day. The population
standard deviation for group Papy is $270 and $320 for group Pipoy. At the 0.05 significance
level, can Jerry conclude that the average sales of group Pipoy is greater than group Papoy’s?
Determine the p value!

PROBLEM 5.3
A vegan pizza advertisement claims that it can help weight loss. A random sample of 8
influencers show their before and after consumption weight in a table.
Name Alpha Bane Claude Diggie Estes Fanny Gord Harley

Before 65 80 64 76 85 54 83 86

After 62 72 74 68 67 42 77 75

At the 0.01 significance level, can we concluded that the vegan pizza can effectively help
weight loss?
MODULE 6

TWO-SAMPLE TESTS OF HYPOTHESIS


By Patricia Angelica

TWO-SAMPLE TEST ABOUT PROPORTIONS

To conduct the test, we assume each sample is large enough that the normal
distribution will serve as a good approximation of the binomial distribution. The test statistic
follows the standard normal distribution. We compute the value of z from the following
formula:

n1 is the number of observations in the first sample.

n2 is the number of observations in the second sample.

p1 is the proportion in the first sample possessing the trait.

p2 is the proportion in the second sample possessing the trait.

pc is the pooled proportion possessing the trait in the combined samples. It is called the
pooled estimate of the population proportion and is computed from the following formula.

X1 is the number possessing the trait in the first sample.

X2 is the number possessing the trait in the second sample.

Example:
The null and alternate hypotheses are:

A sample of 100 observations from the first population revealedX1 is 70. A sample of 150
observations from the second population revealed X2is 90. Use the 0.05 significance level to
test the hypothesis.

Step 1:State the decision rule.

Step 2:Compute the pooled proportion.

Step 3:Compute the value of the test statistic.

Step 4:What is your decision regarding the null hypothesis?

Solution:

Step 1:State whether to reject or accept Ho based on the z value

1 - α = 1 - 0.05 = 0.95 0.95 – 0.50 = 0.45 z value = (1.64 + 1.65)/2 = 1.645

Reject H0 if z > 1.645

Step 2:Find the pooled estimate of the population proportion for combined samples

Pc = (70+90)/(100+150) = 0.64

Step 3:Determine the test statistic

p1 = 70/100 = 0.70 p2 = 90/150 = 0.60


= 1.61
Z= 0.70−0.60
√0.64×0.36 + 0.64×0.36
100 150

Step 4:Make a decision for H0

Not reject H0
EQUAL POPULATION STANDARD DEVIATION

The following formula is used to pool the sample standard deviations. Notice that two
factors are involved: the number of observations in each sample and the sample standard
deviations themselves.

is the variance (standard deviation squared) of the first sample.

is the variance of the second sample.

X1is the mean of the first sample.

X2is the mean of the second sample.

n1is the number of observations in the first sample.

n2is the number of observations in the second sample.

𝑆2is the pooled estimate of the population variance.

There are three requirements or assumptions for the test:

1. The sampled populations follow the normal distribution.

2. The sampled populations are independent.

3. The standard deviations of the two populations are equal.

Example:

The null and alternate hypotheses are:


A random sample of 10 observations from one population showed a sample mean of 23and a
sample deviation of 4. A random sample of 8 observations from another population showed a
sample mean of 26 and a sample standard deviation of 5. The significance level is 0.05. Is
there a difference between the population means? State: (step 1) the decision rule, (step 2)
compute the pooled estimate of the population variance, (step 3) compute the test statistic,
(step 4) state your decision about the null hypothesis, (step 5) find p-value.

Solution:

Step 1:State whether to reject or accept Ho based on the t value

df = total no. of items sampled minus the no. of samples = n1 + n2 – 2 = 10 + 8 - 2 = 16

Reject H0 if t > 2.120 or t < -2.120

Step 2: Pool the sample variances

𝑆2𝑃 = (10−1)(4)210+8−2 2
+(8−1)(5) = 19.9375

Step 3:Determine the test statistic

Step 4:Make a decision for H0

Do not reject H0. There is no difference between the population means.

Step 5: Report the p-value

Ignore the positive sign of the t-value, so it will be 1.416. At 0.10 significance level with
df=16, the t-value = 1.746since it is two-tailed test. At 0.20 significance level with df=16, the
t-value= 1.337. 1.416 is between 0.10 (10%) and 0.20 (20%) significance level. Hence, p-
value is greater than 0.10 and less than 0.20.
UNEQUAL POPULATION STANDARD DEVIATION

In the previous section, it was necessary to assume that the populations had equal standard
deviations. If the standard deviations are equal, then we use a statistic very much like the
previous section. The sample standard deviations, s1 and s2, are used in place of the respective
population standard deviations. In addition, the degrees of freedom are adjusted downward by
a rather complex approximation formula. The effect is to reduce the number of degrees of
freedom in the test, which will require a larger value of the test statistic to reject the null
hypothesis.

The formula for the t statistic is:

The degrees of freedom statistic is:

Example:

The null and alternate hypotheses are:

A random sample of 15 items from the first population showed a mean of 50 and a standard

deviation of 5. A sample of 12 items for the second population showed a mean of46 and a
standard deviation of 15.Assume the sample populations do not have equal standard deviation
sand use the 0.05 significance level: (step 1) determine the number of degrees of
freedom,(step 2) state the decision rule, (step 3) compute the value of the test statistic, and
(step 4) state your decisionabout the null hypothesis.

Solution:

Step 1:Find the df

25 225 2

( 2+ ) 2 0.1984+31.9602
25 225
(15)
+ (12−1
12 )

Step 2: State whether to reject or accept Ho based on the t value

Reject H0 if t > 2.179 or t < -2.179

Step 3:Find the value of the test statistic

Step 4:Make conclusion

Failed to reject H0
REVIEW QUESTIONS

PROBLEM 6.1

A study was conducted to investigate the effectiveness of hypnotism in reducing pain. Results
for randomly selected subjects are shown in the table below. The “before” value is matched
to an “after” value, and the differences are calculated. The differences have a normal
distribution

Are the sensory measurements, on average, lower after


hypnotism? Test at a 5% significance level

This table shows the before and after values of the data in our sample

PROBLEM 6.2

Two college instructors are interested in whether or not there is any variation in the way they
grade math exams. They each grade the same set of 30 exams. The first instructor’s grades
have a variance of 52.3. The second instructor’s grades have a variance of 89.9. Test the
claim that the first instructor’s variance is smaller. (In most colleges, it is desirable for the
variances of exam grades to be nearly the same among instructors.) The level of significance
is 10%.

PROBLEM 6.3

A random sample of 10 hot drinks from Dispenser A had a mean volume of 203 ml and a
standard deviation (divisor (n −1)) of 3 ml. A random sample of 15 hot drinks
from Dispenser B gave corresponding values of 206 ml and 5 ml. The
amount dispensed by each machine may be assumed to be normally
distributed. Test, at the 5% significance level, the hypothesis that
there is no difference in the variability of the volume dispensed
by the two machines
MODULE 7
ANALYSIS OF VARIANCE
By Ferren Aurelia

ANOVA
Analysis of Variance (ANOVA) is used to test whether two samples are from populations
having equal variances, and it is also applied when we want to compare several population
means simultaneously.

ANOVA Assumptions
Another use of the F distribution is the analysis of variance (ANOVA) technique in which we
compare three or more population means to determine whether they could be equal. To use
ANOVA, we assume the following:
1. The populations follow the normal distribution.
2. The populations have equal standard deviations
3. The populations are independent.
When these conditions are met, F is used as the distribution of the test statistic.

F Distribution
F Distribution is use to test the hypothesis that the variance of one normal population equals
the variance of another normal population. The one application of F distribution is to
compare the two population variances.

Characteristics of the F distribution


1. There is a family of F distributions. A particular member of the family is
determined by two parameters : the degrees of freedom in the numerator and the
degrees of freedom in the denominator.
2. The F distribution is continuous. This means that the value of F can assume an
infinite number of values between zero and positive infinity.
3. The F statistic cannot be negative . The smallest value F can assume is 0.
4. The F distribution is positively skewed. The long tail of the distribution is to the
right- hand side. As the number of degrees of freedom increases in both the numerator
and denominator, the distribution approaches a normal distribution.
5. The F distribution is asymptotic. As the values of F increase, the distribution
approaches the horizontal axis but never touches it.

Comparing Two Population Variances

Where : S2 = sample variance

Example :
Lammers Limos offers limousine service from Government Center in downtown Toledo,
Ohio, to Metro Airport in Detroit. Sean Lammers, president of the company, is considering
two routes. One is via U.S. 25 and the other via I-75. He wants to study the time it takes to
drive to the airport using each route and then compare the results. He collected the following
sample data, which is reported in minutes. Using the .10 significance level, is there a
difference in the variation in the driving times for the two routes?
U.S. Route 25 Interstate 75

52 59

67 60

56 61

45 51

70 56

54 63

64 57

65

Step 1: State the null hypothesis and the alternate hypothesis. The test is two-tailed because
we are looking for a difference in the variation of the two routes. We are not trying to show
that one route has more variation than the other.
𝐻0: 𝜎2 = 𝜎2
𝐻1: 𝜎2 ≠ 𝜎2

Step 2: The significance level is .10 as stated in the problem.


Step 3: Use F distribution to conduct the test statistic.
Step 4: Look for F critical value in distribution table. It is two-tailed test, so significance level
must be divided by two, .
The degree of freedom (df) numerator is

and degree of freedom denominator is : is


3.87.

Step 5: Find the variance ratio/F value for the two samples. We have to know the standard
deviation then squaring it into variance.

*The biggest 𝜎 become the numerator, and the smaller become denominator. Thus the
variance ratio/F calculated is always bigger than 1.00. This also apply in determining the
degree of freedom numerator and denominator*
Variance ratio or F calculated
F calculated > F table = reject null hypothesis
The F calculated value is 4.23 which is bigger than F critical value/table 3.87. So, we reject
null hypothesis.

ANOVA Testing

One-Way ANOVA
We used one-way ANOVA to compare means of three or more samples if the data only has 1
independent variable/category (example: comparison by gender, race, color, age, types, etc).
𝐻0: 𝜇1 = 𝜇2 = 𝜇3 = ⋯ = 𝜇𝑘

𝐻1: 𝑁𝑜𝑡 𝑎𝑙𝑙 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛𝑠 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙

ANOVA Table

Source of Sum of Degrees of Mean Square F


Variation Squares Freedom

Treatments SST k-1 SST/ (k-1) = MST MST/MSE

Error SSE n-k SSE / (n-k) = MSE

Total SS total n-1

SS total (total sum of squares) = ∑ (X- X̅G)2


SSE (sum of squares error) = ∑ (X- X̅c)2
SST (sum of squares treatment) = SS Total – SSE

Where :
X is each sample observation

X̅G is the overall or grand mean

X̅c is the sample mean for treatment c.


Example :

Recently airlines cut services, such as meals and snacks during flights, and started
charging for checked luggage. A group of four carriers hired Brunner Marketing Research
Inc. to survey passengers regarding their level of satisfaction with a recent flight. The survey
included questions on ticketing, boarding, in-flight service, baggage handling, pilot
communication, and so forth. Twenty-five questions offered a range of possible answers :
excellent, good, fair, or poor. A response of excellent was given a score of 4, good a 3, fair a
2, and poor a 1. These responses were then totalled, so the total score was an indication of the
satisfaction with the flight. The greater the score, the higher level of satisfaction with the
service. The higher possible score was 100.

Brunner randomly selected and surveyed passengers from the four airlines. Below is
the sample information. Is there a difference in the mean satisfaction level among the four
airlines? Use the .01 significance level.

Northern WTA Pocono Branson

94 75 70 68

90 68 73 70

85 77 76 72

80 83 78 65

88 80 74

68 65

65

Step 1 : State the hypothesis

𝐻0: 𝜇𝑁 = 𝜇𝑊 = 𝜇𝑃 = 𝜇𝐵

The mean scores are the same for the four airlines.

𝐻1: 𝑇ℎ𝑒 𝑚𝑒𝑎𝑛 𝑠𝑐𝑜𝑟𝑒𝑠 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑎𝑙𝑙 𝑒𝑞𝑢𝑎𝑙


The mean scores are not all the same for the four airlines.
Step 2 : Select the level of significance. We selected the .01 significance level..
Step 3 : Formulate the decision rule.
Degree of freedom in the numerator = k-1 = 4-1 = 3
Degree of freedom in the denominator = n-k = 22-4 = 18
F table (0.01;3;18) 5.09, reject Ho if the computed value of F exceeds 5.09.
Step 4 : Select the sample, perform the calculations, and make a decision.

SS Total = ∑ (X- X̅G)^2

Northern WTA Pocono Branson Total


94 75 70 68
90 68 73 70
85 77 76 72
80 83 78 65
88 80 74
68 65
65
Column total 349 391 510 414 1664
N 4 5 7 6 22
Mean 87.25 78.20 72.86 69.00 75.64
Northern WTA Pocono Branson
337.09 0.41 31.81 58.37
206.21 58.37 6.97 31.81
87.61 1.85 0.13 13.25
19.01 54.17 5.57 113.21
152.77 19.01 2.69
58.37 113.21
113.21
SSE = ∑ (X- X̅c)2
To compute the term SSE, find the deviation between each observation and its treatment
mean. In the example, the mean of the first treatment (the passengers on Nothern Airlines) is
87.25, found by X̅ N = 349/4.

The first passenger rated Northern a 94, so (X-X̅N)2 = (94-87.25)2 = 45.5625. The first
passenger in the WTA group responded with a total score of 75, so (X-X̅W) = (75-78.20)2 =
10.24 the detail for all the passengers follows.

Northern WTA Pocono Branson Total

45.5625 10.24 8.18 1

7.5625 104.04 0.02 1

5.0625 1.44 9.86 9

52.5625 23.04 26.42 16

96.04 50.98 25

23.62 16

61.78

Total 110.7500 234.80 180.86 68 SSE = 594.41

SST = SS Total – SSE

SST = 1485.10 – 594.41 = 890.69

Insert the particular values of F into an ANOVA table and compute the value of F as follows.
The computed value of F is 8.99, which is greater than the critical value / F table of 5.09, so
the null hypothesis is rejected.

Step 5 : Interpret the result. Conclusion is not all populations means are equal.

Example :

Let compute the confidence interval for difference between the mean score of passengers on
Northern and Branson (the SSE is 594.4). With 95% confidence interval, and the mean of
Northern is 87.25, mean of Branson is 69.00. With number of sample of Northern and
Branson are 4 and 6.
WARTA, the Warren Area Regional Transit Authority, is expanding bus service from the suburb
of Starbrick into the central business district of Warren. There are four routes being considered
from Starbrick to downtown Warren : (1) via U.S. 6, (2) via the West End, (3) via the Hickory
Street Bridge, and (4) via Route 59. WARTA conducted several tests to determine whether there
was a difference in the mean travel times along the four routes. Because there will be many
different drivers, the test was set up so each driver drove along each of the four routes. Below is
the travel time, in minutes, for each driver-route combination.

Travel Time
from Starbrick
to Warren
(minutes)

Driver U.S. 6 West End Hickory St. Rte. 59

Deans 18 17 21 22

Snaverly 16 23 23 22

Ormson 21 21 26 22

Zollaco 23 22 29 25

Fillbeck 25 24 28 28

At the .05 significance level, is there a difference in the mean travel time along the four
routes? If we remove the effect of the drivers, is there a difference in the mean travel time?

Step 1 : State the hypothesis

1. Ho : Travel time between the routes from starbrick to warren is the same (µ1 = µ2 = µ3
= µ4).
H1 : Travel time between the routes from starbrick to warren is not the same
2. Ho : The driver mean travel time is the same ( µD = µS =µo = µz = µF)
H1 : The driver mean travel time is not the same

Step 2 : Select the level of significance. We selected the .05 significance level.
Step 3 : Formulate the decision rule.

1. Test the hypothesis concerning the treatment means. There are (k-1) = (4-1) = 3
degrees of freedom in the numerator and (b-1) (k-1) = (5-1) (4-1) = 12 degrees of
freedom in the denominator. Using the .05 significance level, the critical value of F is
3.49. The null hypothesis that the mean times for the four routes are the same is
rejected if the F ratio exceeds 3.49.
2. Test the hypothesis concerning the blocks means. The degrees of freedom in the
numerator for blocks are (b-1) = (5-1) = 4. The degrees of freedom for the
denominator are the same as before : (b-1) (k-1) = (5-1) (4-1) = 12. The null
hypothesis that the driver mean travel time are the same is rejected if the F ratio
exceeds 3.26.

Step 4 : Select the sample, perform the calculations, and make a decision.

SS Total = ∑ (X-X̅ G)2

SS Total = ∑ (X-X̅ G)2

= (18-22.8)2 + (17-22.8)2 + …. + (28-22.8)2

= 229.2

SSB = k∑ (X̅ b-X̅ G)2

= 4(19.5-22.8)2 + 4(21.0-22.8)2 + 4(22.5-22.8)2 + 4(24.75-22.8)2 + 4(26.25-22.8)2

= 119.7

Travel Time
from
Starbrick to
Warren
(minutes)
Driver U.S. 6 West End Hickory St. Rte. 59 Driver Driver
Sums Means
Deans 18 17 21 22 78 19.50
Snaverly 16 23 23 22 84 21.00

Ormson 21 21 26 22 90 22.50
Zollaco 23 22 29 25 99 24.75
Fillbeck 25 24 28 28 105 26.25

SST = b∑ (X̅t-X̅G)2

= 5(20.6-22.8)2 + 5(21.4-22.8)2 + 5(25.4-22.8)2 + 5(23.8-22.8)2

= 72.8
Travel Time from
Starbrick to
Warren (minutes)

Driver U.S. 6 West End Hickory St. Rte. 59


Deans 18 17 21 22
Snaverly 16 23 23 22
Ormson 21 21 26 22
Zollaco 23 22 29 25
Fillbeck 25 24 28 28
Travel Time Sums 103 107 127 119

Travel Time Means 20.6 21.4 25.4 23.8

SSE = SS Total – SST – SSB = 229.2 – 72.8 – 119.7

= 36.7

Source of Sum of Squares Degrees of Mean Square F


Variation Freedom
Treatments 72.8 3 24.27 7.93
Blocks 119.7 4 29.93 9.78

Step 5 : Interpret the result.

Treatment : The null hypothesis is rejected. The travel time between the routes from starbrick
to warren is not the same. F calculated > F table (7.93 > 3.49)

Block : The null hypothesis is rejected. The driver mean travel time is not the same. F
calculated > F table (9.78 > 3.26)
REVIEW QUESTIONS PROBLEM 7.1
A study was conducted to determine whether there are differences in the amount of instant
coffee consumed. The data below is the amount of instant coffee that household drink during
month. These four types of instant coffee are compared by:

Nescafe Excelso Luwak Indocafe


White
Coffee
9 11 14 11

13 15 16 8

12 17 5 17

10 7 13 10

At the 0.05 significance level, is there a difference in the amount of noodle consumed by their
brand?

PROBLEM 7.2

Sainz company is a T-Shirt manufacturer that sells three types of clothing size, small,
medium and large. Sales, in millions of dollars, for the past 5 months are given in the
following table. Using the .01 significance level, test whether the mean sales difference for
the three types of clothing sizes and by month.

Sales ( $ million)

Month Small Medium Large

January 12 8 9

February 6 12 4

March 10 14 11

April 8 5 13

May 7 10 6
MODULE 8
CORRELATION AND LINEAR REGRESSION
By Ferren Aurelia

Correlation Analysis
A group of techniques to measure the relationship between two variables. It provides a
quantative measure of the strength of the relationship between two variables. For example,
wheter the stocks of two airlines rise and fall in any related number.

Correlation Coefficient
Describe the strength of the relationship between two sets of interval-scaled or ratio-scaled
variables. It ranges from -1 up to and including +1.

where :
𝑠𝑦 = standard deviation y
𝑠𝑥 = standard deviation x
𝑛 = number of data
Example :
Suppose in the Heart Hospital there are 7 doctors for a month and they have examined the
patient. The doctor have made a prescription which the patient received medicine. We
obtained the following results and want to know if there is any relationship between the
measured variables
REVIEW QUESTIONS

PROBLEM 8.1
The manufacturer of Car Tire wants to study the relationship between the numbers of months
since the tire was purchased and the length of time the car tire was used last week. Determine
the coefficient correlation

Person Months Hours


Owned Used
Ollie 8 21
Carlos 4 14
Charles 10 17
George 6 12
Paul 9 10
Bianca 5 11
Leclerc 12 32

PROBLEM 8.2
The city council of Pixie Hollow is considering increasing the number of police in an effort to
reduce crime. Before making a final decision, the council asked the chief of police to survey
other cities of similar size to determine the relationship between the number of police and the
number of crimes reported. The chief gathered the following sample information.

City
Police Number of Crimes

Thneedville 16 6
Axiom 7 11
Grytt 11 8
Auburn 19 10
Blyworth 23 7
Bartons 8 5

Use the data above to compute a correlation coefficient (r) to determine the correlation between
police and number of crimes, and conduct a test of hypothesis to determine if it is reasonable to
conclude that the population correlation is greater than zero. Use the 0.05 significance level.
PROBLEM 8.3
The production department of Astro International wants to explore the relationship between
the number of employees who assemble a subassembly and the number produced. As an
experiment, three employees were assigned to assemble the subassemblies. They produced 20
during a one-hour period. Then five employees assembled them. They produced 30 during a
one-hour period. The complete set of paired observations follows.
Number of One-Hour
Assemblers Productiion
(Units)

3 20

5 30

1 5

7 45

2 15

6 35

a. Compute the correlation coefficient between the two variables. At the 0.05 significance level,
conduct a test of hypothesis to determine if the population correlation is greater than zero.
b. Determine the regression equation.
MODULE 9
ESTIMATING Y VALUE
By Ferren Aurelia

Testing the Significance of the Slope


The next step is to analyze the regression equation by conducting a test of hypothesis to see if
the slope of the regression line is different from zero.

STEP 1 : HYPOTESIS
Two tailed test : One tailed test : One tailed test :
𝐻0 : 𝛽 = 0 𝐻𝑜 ∶ 𝛽 ≥ 0 𝐻0 ∶ 𝛽 ≤ 0
𝐻1 ∶ 𝛽 ≠ 0 𝐻1 ∶ 𝛽 < 0 𝐻1 ∶ 𝛽 > 0

𝛽 = Estimate of the Population’s Slope

STEP 2 : FIND THE T-DISTRIBUTION


Using df = n – k – 1
Because k=1, so df = n – 2

STEP 3 : CALCULATE THE T-VALUE

b = Estimate of the Regression Line’s Slope


𝑆𝑏= Standard Error of the Slope Estimate

STEP 4 : COMPARE T-VALUE WITH T-DISTRIBUTION


If the t-value is not in the acceptable area, then null hypotesis (Ho) is rejected and we
should accept the alternative hypothesis (H1). Hence, the independent variable is an aid in
predicting the dependent variable.

STEP 5 : CONCLUSION
Using t-test, find out if the more sales calls will result in the more sale of more copiers! (use a
5% significance level)
STEP 1 : HYPOTESIS
𝐻𝑜 ∶ 𝛽 ≤ 0
𝐻1 ∶ 𝛽 > 0

STEP 2 : FIND THE T-DISTRIBUTION


Obsevation (n) = 10
df = 10 – 2 = 8 and one tailed test, so = 0.05

then in t-table, the = 1.86


So, the acceptable area is Ho < +1.86

STEP 3 : CALCULATE THE T-VALUE


From the table we know that :

b = 1.18421
Sb = 0.35914

STEP 4 : COMPARE T-VALUE WITH THE T-DISTRIBUTION


From Step 3, we know that the T-Value is 3.29734 while the acceptable area for null
hypotesis is (T-Distribution) Ho < +1.86. Because the T-Value is not in the acceptable
area, so the null hypotesis is rejected and accepts the alternative hypothesis (H1).

STEP 5 : CONCLUSIONS
So, the more calls make by Sales Representative will make more sales of copiers with
95% level of confidence. There is a positive relationship between calls and sale of more
copiers.

Evaluating a Regression Equation’s Ability to Predict


A. STANDARD ERROR OF ESTIMATE
A measure of the dispersion, or scatter, of the observed values around the line of
regression for a given value of X.

*Where :

Σ( − )2= SSE
n = number of observation

Example :
STEP 1 : FIND THE REGRESSION LINE
From the data we know that :
a = 18.9474
b = 1.18421
So, 𝑌̂ = 18.9474 + 1.18421 𝑋

STEP 2 : FIND THE SUM OF RESIDUAL


Sales Sales Copier Estimated Sales (𝒀 − 𝒀̂) (𝒀 − 𝒀̂)𝟐
Representative Calls (X) Sales (Y) (𝒀̂)
Tom 20 30 42.6316 -12.6316 159.5573
Jeff 40 60 66.3158 -6.3158 39.8893
Brian 20 40 42.6316 -2.6316 6.9253
Greg 30 60 54.4737 5.5263 30.54
Susan 10 30 30.7895 -0.7895 0.6233
Carlos 10 40 30.7895 9.2105 84.8333

Rich 20 40 42.6316 -2.6316 6.9253


Mike 20 50 42.6316 7.3684 54.2933
Mark 20 30 42.6316 -12.6316 159.5573

Soni 30 70 54.4737 15.5263 241.066


Total 784.211

∑(𝒀 − 𝒀̂)𝟐 = 784.211

STEP 3 : CALCULATE THE STANDARD ERROR OF ESTMATE

STEP 4 : CONCLUSION

If the standard error of estimate is small, then it can be used to predict Y with a little error.
If the standard error of estimate is large, then it can’t used to predict Y.
B. COEFFICIENT OF DETERMINATION
The proportion of the total variation in the dependent variable Y that is explained, or
accounted for, by the variation in the independent variable X.
Coefficient of Determination is the Coefficient Correlation squared (r2)
In Picture 9.1, the Coefficient Correlation is showed by Multiple R= 0.759. If we square the r
then we get r2 = 0.7592 = 0.576. To interpret the Coefficient of Determination, we should
convert to percent so 0.576 x 100% = 57.6%
If the Coefficient of Determination close to 100%, then it can interpret as the more possible
to make perfect predictions. Then the conclusions is only 57.6% of the variation in the
number of copiers sold is explained, or accounted for, by the variation in the number of sales
calls.

C. RELATIONSHIPS AMONG THE CORRELATION COEFFICIENT, THE


COEFFICIENT OF DETERMINATION, AND THE STANDARD ERROR OF
ESTIMATE
When Data is not shown in excel:
Rich 20 40 -2 4
Mike 20 50 -2 4
Mark 20 30 -2 4

Soni 30 70 8 64
Total 0 760

STEP 2 : CREATE A REGRESSION LINE


From Picture 9.1 we know that the regression line is 𝑌̂ = 18.9474 + 1.18421 𝑋 then the
sales representatives make 25 calls.
𝑌̂ = 18.9474 + 1.18421 (25) = 48.5526

STEP 3 : FIND THE T VALUE


The observation is 10 so df = 10 – 2 = 8 with 95% level of confidence. So, the confidence
level should be 1 – 95% = 5%. Remember that confidence interval is always two tailed test,
so the value of t, t0.025,8 = 2.306

STEP 4 : CALCULATE THE CONFIDENCE


INTERVAL
From the previous section, we know that :

STEP 5 : CONCLUSION
If a sales representatives make 25
calls and expect to sales 48.5526
copies, then the sales will range
from 40.9170 to 56.1882 copiers.
REVIEW QUESTIONS
PROBLEM 9.1
A recent article in Economic Times Magazine listed the “Best Start-up Company.” We are
interested in the current results of the companies’ sales and earnings. A random sample of 10
companies was selected and the sales and earnings, in millions of dollars, are reported below.

Sales Earnings Sales Earnings


Company ($ millions) ($ Company ($ millions) ($
millions) millions)
A $65.4 $8 F $26.3 $3.8

B $14.7 $3.5 G $15.9 $3.2

C $37.2 $5.1 H $10.2 $2.5

D $29.4 $4.3 I $44.7 $6.7

E $58.1 $6.2 J $27.9 $5.3

a. Conduct a test of hypothesis to show whether there is a relationship between sales and
earnings. Show that the slope of the regression is different from zero.
b. Determine the coefficient of determination. Interpret this value.
c. Determine the standard error of estimate. About 95 percent of the residuals will be
between what two values?

PROBLEM 9.2
A nutritionist performed a regression analysis of the relationship between people’s lifespan and
their lifestyle. The regression analysis is lifespan = 11.04 + 0.9372385 (lifestyle). Some
additional output is:

Predictor Coef SE Coef T P


Constant 11.04 8.465 1.30 0.221

Lifestyle 0.937285 0.129317 7.25 2.766

Analysis of Variance
Source DF SS MS F P
Regression 1 1539 1539 52.52 2.766
Residual Error 10 293 29
Total 11 1832

a. How many people were in the sample?


b. Determine the standard error of estimate. About 95 percent of the residuals will be
between what two values?
c. Determine the coefficient of determination.
d. At the 0.05 significance level, does the evidence suggest there is a positive
association between people’s lifespan and their lifestyle?

PROBLEM 9.3
Norris Estate, a Real Estate Company is planning to sell 8 houses. Data of the prices and sizes of
the houses are listed below:
Prices ($ million) Sizes
85 110
55 85
65 100
50 85
125 120
100 90
80 95
40 65

a. Determine the regression equation.


b. Is it reasonable to conclude that there is a positive relationship between price and size of
the house?
c. Determine the standard error of estimate.
d. What if the population in constant 𝛽 equal to zero? Is it possible? Prove it!
MODULE 10

MULTIPLE REGRESSION ANALYSIS

By Ivana Arni Wijaya

Multiple regression analysis is a statical tool which a mathematical model, which is used to
predict a dependent variable by two or more independent variables (in which at least one
predictor is nonlinear)

y = value
of
dependent variable (response variable)

X1,2,... = value of independent variable

b0 = regression constant

b1,2,k = partial regression coefficient for independent variable 1,2,k

k = number of independent variables

e = the error of prediction

Regression constant (b0) and partial regression coefficient (b1,2,k) are population values that are
unknown. These values can be estimated by using sample information. Estimating y with sample
information can be seen below using Model Fit.

y = b0 +
THE MODEL FIT

The procedure for determining formulas to solve for multiple regression coefficient by
using methods of calculus equations, resulting in k + 1 equations with k + 1 unknowns ( b0 and k
values of b1) to minimize the sum of squares of error for regression model. BUT solving the
equation by hand is time-consuming, so in reality, researchers use computer statistical software
package.
Example:

Shown below the data of Electability For Next President of 10 names. Determine the multiple
regression equation. What is the estimated winning candidate for next election, if base on their
track record 15, there are 5 Capability index , and 10 for their leadership ?

Track
Capability Leadership Winning
Name Recor
Index Index Index (%)
d
Index
Jokowi 89 80 75 50

Megawati 50 60 19 13

Ridwan 33 20 26 10.5

Susi 32 25 83 34

Prabowo 62 23 47 33

Anies 37 50 65 45

Fahri 21 43 21 20

Gatot 29 64 76 43

Tito 50 74 88 22

Sri
77 83 65 39
Mulyani

There are 3 indenpendent variables:

1. X1 = Track Record Index


2. X2 = Capability Index
3. X3 = Leadership Index
Using the regression portion of Excel

b0 is labeled as “intercept” in the excel output

Ŷ = b0 + b1X1 + b2X2 + b3X3

Ŷ = 4.908+0.158X1+0.006X2 + 0.321X3

We can now estimate or predict the next president as we know the their track record 15, there are
5 Capability index , and 10 for their leadership ?

By substituting the values for the independent variables:

Ŷ = 4.908+0.158(15)+0.006(5) + 0.321(10)

Ŷ = 10.581

INFERENCES IN MULTIPLE REGRESSION ANALYSIS

I. Global Test: Testing the Multiple Regression Model

STEP 1: ESTABLISH THE HYPOTHESIS

H0 : β1 = β2 = ... = βk = 0

H1 : not all the βk’s is 0

STEP 2: CALCULATE DEGREES OF FREEDOM FORMULA


dfreg = k (number of independent variables),

dferr = N-k-1

and use the f distribution table to determine a critical value

Assume it has a 95% level of confidence. Refers to the previous example, N = 10

dfreg = k [number of independent variables] = 3

dferr = N – k – 1 = 10 – 3 – 1 = 6

Therefore, the value of distribution table is F0.5,3,6= 4.76

STEP 3: COMPARE WITH THE F VALUE (OBSERVED F VALUE)

F value : 4.76

F observed :

F observed in the ANOVA table

The Fobserved is 2.23

To calculate F-statistic, formula below is used.

F-statistic should later be compared with F-observed, in which the difference between the
two, results in P-value. P-values evaluate how well the sample data support the argument
that the null hypothesis is true. In this case, F-statistic is 2.23 and the F-observed is also
2.23 which means they’re equal

STEP 4: STATISTICAL CONCLUSION

If Fobserved > Fα(value), Reject H0

If Pvalue > α, Reject H0

Therefore, we should accept H0 because F observed is smaller than F value 2.23<4.76


If we accept the null hypothesis, we are stating that the regression model has no
significant predictability for the dependant variable (at least one of the independent variable
is adding significant predictability for y).

To evaluate which independent variable is the best predictor, we evaluate individual


regression coefficients below.

II. Evaluating Individual Regression Coefficients

STEP 1: ESTABLISH THE HYPOTHESIS

For Variable 1: For Variable 2: For


Variable 3:

H0 : β1 = 0 H0 : β2 = 0 H0 : β3 = 0

H1 : β1 ≠ 0 H1 : β2 ≠ 0 H1 : β3 ≠ 0

STEP 2: CALCULATE DEGREES OF FREEDOM FORMULA

df = N – k – 1 = 6
Refers to the previous example
and use the t distribution table to determine a critical value (tα/2;N-k-1) :

Assume it has a 95% level of confidence and a two-tailed test

𝛼 0.05
⁄2 = ⁄2 = 0.025

The t value is t0.025,6 = ±2.447


STEP 3: COMPARE WITH THE T-VALUE (OBSERVED VALUE) FOR EACH REGRESSION
COEFFICIENT

tobserved (t-stat on table):

β1 = 0,7476

β2 = 0,0275

β3 = 1,9978

If the value of t-statistic doesn’t provided on the table, we can calculate it by formula below.

Bk = k’s regression coefficient

Sbk = standard deviation of regression coefficient’s distributions

t-statistic should be compared with t-value to know the p-value. It test the independent
variables individually to determine whether the net regression coefficients differ from zero.

STEP 4: STATISTICAL CONCLUSION

If |tobserved| > |tα|, reject H0

If |tobserved| ≤ |tα|,

acceptH0

The observed value for furnace age (β3) is smaller than its critical value so the null
hypothesis is not rejected. While null hypothesis is rejected for temperature (β1)’s and
insulation (β2)’s tobserved. In another word, both’s variables are significant predictors in
estimating heating cost for a home and researchers should drop furnace age (β3) from
analysis.
REVIEW QUESTIONS

PROBLEM 10.1

Consider the Anova Table that follows :

Source Df SS MS F

Regression 2 55,804 0,881 5.570

Residual Error 50 250,450 5,009

Total 52 306,254 5,890

a. Determine the standard error of estimate.


b. Determine the coefficient of multiple determination and interpret this value.
c. Determine the coefficient of multiple determination, adjusted for the degrees of freedom.

PROBLEM 10.2

The following regression output was obtained from a study of botanical garden firms. The
dependent variable is the total amount of the fees in millions of dollars.

SE
Predictor Coefficient Coefficient t P-value

Constant 7.987 2.967 2.69 0.01

X1 0.122 0.031 3.94 0

X2 -1.22 0.053 -23.02 0.028

X3 -0.063 0.03 -2.1 0.114

X4 0.523 0.142 3.68 0.001

X5 -0.065 0.04 -1.63 -0.112


Analysis of Variance

Source DF SS MS F F

Regression 5 3710 742 12.89 0

Residual Error 46 2647.38 57.55 - -

Total 51 6357.38

X1 is the number of gardener employed by the company

X2 is the number of scientist employed by the company

X3 is the number of years involved with botanical care projects

X4 is the number of states in which the firm operates

X5 is the percent of the firm’s work that is botanical care-related

Write out the regression equation

a. How large is the sample? How many independent variables are there ? How many
dependent variables are there?
b. Conduct a global test of hypothesis to see if any of the set regression coefficients could be
different from 0. Use the 0,05 significance level. What is your conclusion?
c. Conduct a test of hypothesis for each independent variable. Use the 0,05 significance
level which variable would you consider eliminating first ?
d. Outime a strategy for deleting independent variable in this case.

PROBLEM 10.3
Performance on the new menu is designated Y.

The equation is: Y’ = 11.6 + 0.4𝑥1+ 0.286𝑥2+ 0.112𝑥3+ 0.002𝑥4, if :

● 𝑥1= length of time an employee was in the industry


● 𝑥2= Job task test
● 𝑥3= Prior on-the-job rating
● 𝑥4= Price

Answer the following questions :

a. What is the equation called ?


b. How many dependent variables and independent variables are there?
c. What is the number of 0.286 called?
d. As price increased by one thousand dollars, how much does the estimated performance on
the new menu increase?
e. Jimin applied for a job at a cafe. He has been in the business for 6 years and scored 300
on the job task test. Jimin’s prior on-the-job performance rating is 98, and the price right
now is 4000 dollars. Estimate Jimin’s performance on the new menu.
MODULE 11
NONPARAMETRIC METHODS: GOODNESS-OF-FIT TESTS

By Tiffany

A goodness-of-fit test, in general, to compare an observed frequency distribution to an expected


frequency distribution for variables measured on a nominal or ordinal scale.

1. Hypothesis Test of Equal Expected Frequencies

A six-sided die is rolled 30 times and the number 1 through 6 appears as shown in the
following frequency distribution. Can we conclude that the die is fair?

Outcome Frequency

1 3

2 6

3 2

4 3

5 9

6 7

Step 1 : State the null hypothesis and the alternative hypothesis

𝐻0 ∶ 𝑇ℎ𝑒 𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑎𝑟𝑒 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒

𝐻1 : 𝑇ℎ𝑒 𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒

Step 2 : Select the level of significance.

We selected the 0.10 significance level.


Step 3 : Select the test statistic

The test statistic follows the Chi-Square distribution

With, k-1 is degrees of freedom, where :

K is number of categories

𝑓0 is an observed Frequency in a particular category

𝑓𝑒 is an expected Frequency in a particular category Chi Square Test

Step 4 : Formulate the decision rule


𝑿𝟐 = ∑ [ (𝒇
The Number of degrees of Freedom is k - 1, because there are six categories,
there are k - 1= 6 - 1 = 5 degrees of freedom. The critical value for 5 degrees of
freedom and the 0.10 significance level, we can found The Critical Value is
9.236. That means Reject 𝐻0 if 𝑋2 > 9.236

Step 5 : Compute the value of Chi-Square and make a decision

(Note : for Equal Expected Frequencies, Expected Frequencies are the same for
each cell)
∑ 𝑓0 3+6+⋯+7 30
𝑓𝑒 = = = =5
𝑛 6 6

Outcome Frequency (𝑓0) 𝑓𝑒 (𝒇𝟎 − 𝒇𝒆)𝟐 (𝒇𝟎 − 𝒇𝒆)𝟐


𝒇𝒆

1 3 5 4 0.8
2 6 5 1 0.2

3 2 5 9 1.8

4 3 5 4 0.8

5 9 5 16 3.2

6 7 5 4 0.8

(𝑓0 − 𝑓𝑒)2 ] = 0.8 + 0.2 + ⋯ + 0.8 = 𝟕. 𝟔𝟎


𝑋2 = ∑ [
𝑓𝑒

Step 6 : Interpret The result

Do not reject 𝐻0, cannot reject 𝐻0 that outcomes are the same.

Note :

The Characteristics of the Chi-Square Distribution are :

- The Value of Chi-Square is never negative

- The Chi-Square Distribution is positively skewed

- There is a family of Chi-Square Distribution

o Each time the degrees of freedom change, a new distribution is formed

o As the degrees of freedom increase, the distribution approaches a normal


distribution

2. Hypothesis Test of Unequal Expected Frequencies

From Experience, the bank credit department of Carolina Bank knows that 5% of its card
holders have had some high school, 15% have completed high school, 25% have had some
college, and 55% have completed college. Of the 500 card holders whose cards have been
called in for failure to pay their charges this month, 50 had some high school, 100 had
completed high school, 190 had some college, and 160 had completed college. Can we
conclude that the distribution of card holders who do not pay their charges is different from
all others?
Step 1 : State the null hypothesis and the alternative hypothesis

𝐻0: 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛𝑠 𝑎𝑟𝑒 𝑎𝑠 𝑠𝑡𝑎𝑡𝑒𝑑

𝐻1: 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛𝑠 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑎𝑠 𝑠𝑡𝑎𝑡𝑒𝑑

Step 2 : Select the level of significance.

We selected the 0.01 significance level.

Step 3 : Select the test statistic

The test statistic follows the Chi-Square distribution

Step 4 : Formulate the decision rule

The Number of degrees of Freedom is k - 1, because there are four categories,


there are k - 1= 4 - 1 = 3 degrees of freedom. The critical value for 3 degrees of
freedom and the 0.01 significance level, we can found The Critical Value is
11.345. That means Reject 𝐻0 if 𝑋2 > 11.345

Step 5 : Compute the value of Chi-Square and make a decision

Data 𝑓0 𝑓𝑒 (𝑓0 − 𝑓𝑒 )2 (𝑓0 − 𝑓𝑒)2


𝑓𝑒

Some high school 50 (0.05)(500) = 25 625 25.00

completed high school 100 (0.15)(500) = 75 625 8.33

some college 190 (0.25)(500) = 125 4225 33.80

completed college 160 (0.55)(500) = 275 13225 48.09

(𝑓0 − 𝑓𝑒 )2 ] = 25.00 + 8.33 + 33.80 + 48.09 =


𝑋 =∑[
2
𝑓𝑒 115.22424
Step 6 : Interpret The result

Reject 𝐻0, The proportions are not as stated

3. Limitations of Chi-Square

If there is an unusually small expected frequency in a cell, chi-square (if applied) might result
in an erroneous conclusion. This can happen because appears in the denominator, and
dividing by a very small number makes the quotient quite large! Two generally accepted
policies regarding small cell frequencies are:
● If there are only two cells, the expected frequency in each cell should be at least 5.

● For more than two cells, chi-square should not be used if more than 20 percent of the cells
have expected frequencies less than 5.

4. Testing The Hypothesis That A Distribution Is Normal

A goodness-of-fit test can also be used to determine whether a sample of observations is from
a normal population.

First, calculate the mean and standard deviation of the sample data, Group the data into a
frequency distribution. Convert the class limits to z values and find the standard normal
probability distribution for each class. For each class, find the expected normally distributed
frequency by multiplying the standard normal probability distribution by the class frequency.
Calculate the Chi-Square goodness-of-fit statistic based on the observed and expected class
frequencies. Find the expected frequency in each cell by determining the product of the
probability of finding a value in each cell by the total number of observations. If we use the
information on the sample mean and the sample standard deviation from the sample data, the
degrees of freedom are k - 3. But if we know the mean and the standard deviation of a
population, the degrees of freedom are k – 1.

Example :
The IRS is interested in the number of individual tax forms prepared by small accounting
firms. The IRS randomly sampled 50 public accounting firms with 10 or fewer employees in
the Dallas-Fort Worth area. The following frequency tables reports the result of the study.
Assume the sample mean is 44.8 clients and the sample standard deviation is 9.37 clients. Is it
reasonable to conclude that the sample data are from a population that follows a normal
probability distribution? Use the 0.05 Significance level.

Number of Clients Frequency

20 up to 30 1

30 up to 40 15

40 up to 50 22

50 up to 60 8

60 up to 70 4

Step 1 : State the null hypothesis and the alternative hypothesis

𝐻0: 𝑇ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑐𝑙𝑖𝑒𝑛𝑡𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑠 𝑎 𝑛𝑜𝑟𝑚𝑎𝑙 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛

𝐻1: 𝑇ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑐𝑙𝑖𝑒𝑛𝑡𝑠 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑓𝑜𝑙𝑙𝑜𝑤𝑠 𝑎 𝑛𝑜𝑟𝑚𝑎𝑙 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛

Step 2 : Select the level of significance.

We selected the 0.01 significance level.

Step 3 : Select the test statistic

The test statistic follows the Chi-Square distribution

Step 4 : Formulate the decision rule

The Number of degrees of Freedom is k - 3, because we don’t know the mean


and the standard deviation of a population. There are k - 3= 5 - 3 = 2 degrees of
freedom. The critical value for 2 degrees of freedom and the 0.05 significance
level, we can find The Critical Value is 5.991. That means Reject 𝐻0 if 𝑋2 >
5.991
Step 5 : Compute the value of Chi-Square and make a decision

To test for a normal distribution, we need to find the expected frequencies for
each class in the distribution, start with the normal distribution by calculating
probabilities for each class.
𝑥 − 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
𝑧=
𝑠

To illustrate the computation, we select class 20 up to 30 from Table

For the upper limit of the 20 up to 30 class:

𝑧 = 𝑥−𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
𝑠 = 30−44.8 =9.37
−1.58

The probability of finding a z-value -1.58 is .5000 - .4429= 0.0571.


𝑥 − 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 40 − 44.8
𝑧=
𝑠 = 9.37 = −0.51

The probability of finding a z-value -0.51 is 0.1950

Number of Z values Area Found by 𝑓𝑒


Clients

Under 30 Under – 1.58 0.0571 0.5000 - 0.4429 2.855

30 up to 40 -1.58 up to -0.51 0.2479 0.4429 - 0.1950 12.395

40 up to 50 -0.51 up to 0.55 0.4038 0.1950 + 0.2088 20.19

50 up to 60 0.55 up to 1.62 0.2386 0.4474 – 0.2088 11.93

60 or more 1.62 or more 0.0526 0.5000 - 0.4474 2.63

Computations of the Chi-Square Statistic: Computations of the Chi-Square


Statistic: 𝑋2 = ∑ [(𝑓0−𝑓𝑒)2]
𝑓𝑒

Number of Area 𝑓0 𝑓𝑒 𝑓0 − 𝑓𝑒 (𝑓0 − 𝑓𝑒)2 (𝑓0 − 𝑓𝑒)2


Clients 𝑓𝑒
Under 40 0.3050 16 15.25 -0.75 0.5625 0.0369

40 up to 50 0.4038 22 20.19 -1.81 3.2761 0.1623

50 or more 0.2912 12 14.56 2.56 6.5536 0.4501

TOTAL 1.0000 50 50.00 0 0.6493

Step 6 : Interpret The result

Do not Reject 𝐻0, These data could be from a normal distribution, because
0.6493 is not greater than 5.991.

5. Contingency Table Analysis

A contingency table is used to test whether two traits or characteristics are related

The Degrees of Freedom are 𝒅𝒇 = (𝑹𝒐𝒘 − 𝟏)(𝑪𝒐𝒍𝒖𝒎𝒏 − 𝟏)

For Example :
The Director of advertising for the Carolina Sun Times, the largest newspaper in the Carolinas, is studying
the relationship between the type of community in which a subscriber resides and the section of the
newspaper he or she reads first. For a sample of readers, she collected the sample information in the
following table.
At the 0.05 significance level, can we conclude there is a relationship between the type of
community where the person resides and the section of the paper read first?

Step 1 : State the null hypothesis and the alternative hypothesis (𝑹𝒐𝒘
𝒕𝒐𝒕𝒂𝒍)(𝑪𝒐𝒍𝒖𝒎𝒏 𝒕𝒐𝒕𝒂𝒍)
𝐻0: 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑐𝑜𝑚𝑢𝑛𝑖𝑡𝑦 𝑠𝑖𝑧𝑒 𝑎𝑛𝑑 𝑠𝑒𝑐𝑡𝑖𝑜𝑛 𝑟𝑒𝑎𝑑
𝒇 =
𝐻1: 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑎 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝

Step 2 : Select the level of significance.


We selected the 0.05 significance level.

Step 3 : Select the test statistic

The test statistic follows the Chi-Square distribution

Step 4 : Formulate the decision rule

The Number of degrees of Freedom is 𝑑𝑓 = (𝑅𝑜𝑤 − 1)(𝐶𝑜𝑙𝑢𝑚𝑛 − 1) , because


there are four categories, there are 𝑑𝑓 = (𝑅𝑜𝑤 − 1)(𝐶𝑜𝑙𝑢𝑚𝑛 − 1) = (3 −
1)(3 − 1) =degrees
2 × 2 =of
4 freedom. The critical value for 4 degrees of freedom and
the 0.05 significance level, we can found The Critical Value is
9.488. That means Reject 𝐻0 if 𝑋2 > 9.488

Step 5 : Compute the value of Chi-Square


and make a decision
(𝑹𝒐𝒘 𝒕𝒐𝒕𝒂𝒍)(𝑪𝒐𝒍𝒖𝒎𝒏 𝒕𝒐𝒕𝒂𝒍)
𝒇𝒆 =
𝑮𝒓𝒂𝒏𝒅 𝑻𝒐𝒕𝒂𝒍
(384)(420) (384)(326) (384)(278)
𝑓𝑒 = 𝑓𝑒 = 𝑓𝑒 =
1024 1024 1024

= 104.25
= 157.5 = 122.25

Suburb 120 112 100 332

𝑓𝑒 = 136.171875 𝑓𝑒 = 105.6953125 𝑓𝑒 = 90.1328125

Rural 130 90 88 308

𝑓𝑒 = 126.328125 𝑓𝑒 = 98.0546875 𝑓𝑒 = 83.6171875

TOTAL 420 326 278 1024

(𝑓0 − 𝑓𝑒)2 (170 − 157.5)2 +⋯+ (88 − 83.62)2 = 7.340


𝑋 =∑[
2
]= 157.5 83.62
𝑓𝑒

Step 6 : Interpret The result

Do not reject 𝐻0, there is no relationship between community size and section
read
REVIEW QUESTIONS

PROBLEM 11.1

In Korea , there are three Drama airing now. According to a report in this morning’s local
newspaper, a random sample of 100 viewers last night revealed : 82 people watched Hospital
Playlist , 76 people watched Itaewon Class , 52 people watched Dream High . At the 0.02
significance level, is there a difference in the proportion of viewers watching the three dramas ?

PROBLEM 11.2

A junior high school principal wanted to make uniform t-shirt for all their students. They
conducted a survey investigated the colors that each students in each grades want for the uniform
t-shirt. According to the results last year, 46% of the students wanted the color of black, 10% of
the student wanted the color of yellow , 14% of the students wanted the color of red , and 30% of
the students wanted the color of the white. Listed below is a breakdown of a sample of 350
responses randomly selected from all responses from all students in the school this month. At the
0.1 Significance level, does the distribution of the responses last year reflect all the students this
year?

Color Frequency

Black 241

Yellow 92

Red 94

White 123

TOTAL 550

PROBLEM 11.3

EXO Manufacturing Company believes that their hourly wages follow a normal probability
distribution. To confirm this, 100 employees were sampled and the result organized into the
following frequency distribution. The sample mean is 7.321 and the sample standard deviation is
3. At the 0.02 significance level, is it reasonable to conclude that the distribution of hourly wages
follows a normal distribution?

Hourly Wage ( $) Frequency

1.00 up to 1.50 17

2.00 up to 2.50 21

3.00 up to 3.50 12

4.00 up to 4.50 39

5.00 up to 5.50 11

TOTAL 100

PROBLEM 11.4

A survey investigated the public’s opinion toward Boba price. Each sampled citizen was
classified as to whether he or she felt the Boba seller should reduce the price or increase the price
, or if the individual had no opinion. The sample results of the study by gender are reported
below.

Gender Reduce The Price Increase The Price No Opinion

Female 84 61 99

Male 42 52 21

At 0.05 significance level, is it reasonable to conclude that gender is independent of a person’s


opinion on the Boba price ?
MODULE 12
NONPARAMETRIC METHODS: ANALYSIS OF ORDINAL DATA
By Stacia Andany

In this module, you will learn about the specific hypothesis tests that are being used to test non-
parametric data (data that does not rely on numbers to form a normal distribution/trends, but
based on a certain ranking system or an order to the data available). There are 4 testing methods
that you’ll be required to know for this module:
- The Sign Test.
- Wilcoxon Signed-Rank Test.
- Kruskal-Wallis Test.
- Rank-order Correlation Test.
This module will discuss all of them, starting from The Sign Test.

The Sign Test


- Based on the sign (+/-) assigned to a difference in two related observation between two
dependent populations.
- The plus sign (+) is assigned to a positive difference (increase) in a member of
population between two observations.
- The minus sign (-) is assigned to a negative difference (decrease) between two
observations in a member of the population.
- If there is no difference between the two observations, then the sign is zero and will be
excluded from the test.
- For sign test isn’t concerned with the size of the difference (the number), only the
direction of the difference (+ or -).
- Follows the Binomial Distribution for critical values.
- Can be both two-tailed or one-tailed
Example of application using the 6-step hypothesis-testing procedure
Sample Problem
A sample of 15 managers is randomly selected from a company. Each manager is rated on his or
her computer knowledge by a panel of experts. They were rated as either “outstanding”,
“excellent”, “good”, “fair” or “poor” with “outstanding” being the best and “poor” being the
worst.

The managers were then given a 3-month training program and were rated by the same panel of
experts again. The following table compares the managers’ old and new knowledge:
Name Before After Sign of Difference
Tyler Good Outstanding +
Sue Fair Excellent +
James Excellent Good -
Jackson Poor Good +
Andy Excellent Excellent 0*
Sarah Good Outstanding +
Antonia Poor Fair +
Jean Excellent Outstanding +
Coy Good Poor -
Troy Poor Good +
Virginia Good Outstanding +
Juan Fair Excellent +
Candy Good Fair -
Arthur Good Outstanding +
Sandy Poor Good +
* Andy’s knowledge neither improved nor declined after the training, so his sign of difference is
zero and therefore has been excluded from the test.
Solution
Step 1: State the Null and Alternative Hypothesis
H0: π ≤ 0.50
There has been no change in the computer knowledge of the managers after the computer-
training program.
H1: π >0.50
There has been an increase in the computer knowledge of the manager after the computer-
training program
- The π symbol refers to the proportion of the population that has a specific characteristic.
This test is one-tailed.
- In this case, there are only two outcomes: “success” and “failure”, hence the probability
in all observations for both outcomes is 0.50.
- Number of trials in this observation is fixed (n=15) and each trialis independent from
each other.
Step 2: Select a level of significance
The level of significance for this test is 0.10
Step 3: Decide on the test statistic.
We’re using the number of plus signs that resulted from the observation of the change in the
manager’s knowledge level.
Step 4: Formulate a decision rule.
- Fifteen managers were entered into the training, but Andy showed no change in his
knowledge level, so his sign difference is 0 and therefore excluded from the test. So,
n=14.
- From the binomial probability distribution table for n=14 and X=0.50, we know that:
No. of Successes Probability of Success Cumulative Probability

0 0.000 1.000

1 0.001 0.999

2 0.006 0.998

3 0.022 0.992

4 0.061 0.970

5 0.122 0.909

6 0.183 0.787

7 0.209 0.604

8 0.183 0.395

9 0.122 0.212

10 0.061 0.090

11 0.022 0.029

12 0.006 0.007
13 0.001 0.001

14 0.000 0.000

- The probabilities are added from bottom – up.


Since we have a significance level of 0.10, we have to find the cumulative probability level that
is the closest to 0.10 but does not exceed it. That cumulative probability is 0.090 with the
corresponding No. of Success of 10. From this we can say that if the number of plus (or
successes) exceeds 10, the null hypothesis is rejected and the alternate hypothesis is accepted

Step 5: Make a decision regarding the null hypothesis


From the observation result, 11 out of 14 managers have (+) in their sign difference, the number
11 is in the rejection region,which start at 10, so the null hypothesis is rejected.
Step 6: Interpret the Result
Since the null hypothesis is rejected, we can conclude that the computer training program was
effective as it increased the computer knowledge of the managers.

Wilcoxon Signed-Rank Test


- Hypothesis test on dependent populations
- The test follows the Wilcoxon T Valuesas its critical values.
- One tailed Test.
- Intended for data that had a subjective collection process.

Example of application using the 6-step hypothesis testing procedure.


Sample Problem
A fried chicken fast food restaurant takes a random sample of 15 customers. Each customer is
given a piece of fried chicken with the restaurant’s current flavoring and asked to rate it from 1 to
20. A score of 20 means that the customer really likes the flavor and a score of 0 means that the
customers dislike the flavor. The same 15 participants are then given a piece of chicken with the
restaurant’s new flavor and also asked to rate it from 1 to 20. From the reported results above,
can we say that the new flavor is preferred over the current flavor? Use a 0.05 significance level.
Customer Name Score for New Flavor Score for Current Flavor
Arquette 14 12
Jones 8 16
Fish 6 2
Wagner 18 4
Badenhop 20 12
Hall 16 16
Fowler 14 5
Virost 6 16
Garcia 19 10
Sundar 18 10
Miller 16 13
Peterson 18 2
Boggart 4 13
Hein 7 14
Whitten 16 4

Solution
Step 1: State the Null and Alternate Hypothesis
H0: There is no difference in the ratings of the two flavors.
H1: The ratings for the new flavor are higher.
Since it’s either no difference or the new flavor having higher ratings, the test is one-tailed.
Step 2: Identify the Level of significance (0.05)

Step 3: Determine the Test Statistics


We’re using the ratings given by the sampled customer in the restaurant to determine whether the
new flavor is more favorable or not.
Step 4: Conduct the Wilcoxon Signed-Rank Test using T values
1. The first step in conducting this test is to find the difference between the ratings of the
two flavors and its absolute value, which will be ranked in ascending order, from the one
with the least difference value and the one with the most difference value, after that the
difference values will be given a signed rank based on the actual plus or minus difference,
as demonstrated below:
Participant Score Score Signed Signed
Difference Absolute
New Current in Scores Difference Rank Rank Rank
Flavor Flavor R+ R--
Arquette 14 12 2 2 1 1
Jones 8 16 -8 8 6 6
Fish 6 2 4 4 3 3
Wagner 18 4 14 14 13 13
Badenhop 20 12 8 8 6 6
Hall 16 16 0 * *
Fowler 14 5 9 9 9 9
Virost 6 16 -10 10 11 11
Garcia 19 10 9 9 9 9
Sundar 18 10 8 8 6 6
Miller 16 13 3 3 2 2
Peterson 18 2 16 16 14 14
Boggart 4 13 -9 9 9 9
Hein 7 14 -7 7 4 4
Whitten 16 4 12 12 12 12
Sums 75 30
*Hall’s scoring between the two flavors have no difference, Hall’s has been excluded from
the difference sum.
For the Wilcoxon Signed-Rank test, only the smaller sum of the signed rank will be used in
comparison with the critical value from the Wilcoxon T value to determine whether the null
hypothesis should be rejected or not.
2. After getting the signed rank sum from the Wilcoxon signed rank test, we go to the
Wilcoxon T-value table to find the critical value for this problem using the number of
samples (n=14, since Hall has been excluded due to having no difference value) and also
using the significance level provided (α=0.05). From the Wilcoxon T-value table, using
the intersection between n=14 and α=0.05, the critical value is 25. Based on the null
hypothesis that remarks no difference, so the null hypothesis should be rejected if the
smaller signed rank sum is 25 or smaller.
Step 5: Make a Decision Regarding the Null Hypothesis
Since the smaller signed-rank sum is 30, which is bigger than 25, the conclusion is not reject the
null hypothesis.
Step 6: Interpret the Result
The rejection of the null hypothesis means that the ratings for the current chicken flavor is higher
compared to the rating for the new chicken flavor.

Wilcoxon Signed-Rank Test For Independent Population

National News Sports Food

City 170 124 90

Suburb 120 112 100

Rural 130 90 88

National News Sports Food TOTAL

City 170 124 90 384

Where :
n1 is the number of observations from the first population.
n2 is the number of observations from the second population.
U is the sum of the ranks from the first population

Kruskall-Wallis Test
This test is an alternative to the one-way ANOVA (analysis of variance). Should be used when:
- The data does not follow a normal distribution.
- Population standard deviation and/or variance are unequal.
- Samples selected from the population are independent.
Uses the Chi-Square Table as its critical values.
Formula for Kruskall-Wallis Test 𝟏𝟐 (𝚺𝑹𝟏)𝟐
𝑯= [
(𝚺𝑹𝟐)(𝚺𝑹𝒌)𝟐 𝒏(𝒏 + 𝟏)
𝒏𝟐+ ⋯ +
𝒌
] − 𝟑(𝒏 + 𝟏)With k-1 degrees of freedom (k is the number of populations), where:
1. ΣR1, ΣR2,...,ΣRk are the sums of the ranks of samples 1, 2, ….., k respectively.
2. n1, n2, n3,…., nk are the sizes of sample 1,2, ….., k respectively. Where n is the combined
number of observations for all samples.
Example of application using the 6-step hypothesis testing procedure.
Sample Problem
The director of a Hospital Systems company is concerned about the emergency treatment waiting
times for patients in the 3 hospitals around the city that it operates. To find out, the director
selected random samples of patients in the three locations and the following data was collected:

St. Luke’s Memorial Swedish Medical Center Piedmont Hospital


56 103 42
39 87 38
48 51 89
38 95 75
73 68 35
60 42 61
62 107
89
Using a 0.05 significance level, find out whether there is a difference in the waiting times at the
three hospitals!

Solution
Step 1: State the Null and Alternate Hypothesis
H0: The Population distributions of waiting times are the same for the three hospitals.
H1: The Population distributions are not all the same for the three hospitals.
Step 2: Determine the level of significance and degree of freedom
As given in the problem, the level of significance is 0.05. Degree of freedom is k-1, k is the number of
population used in the test, which is 3. So, the degree of freedom is 3-1= 2. n number of samples is 21 (n=21).
Step 3: Determine the critical value for the test
For Df = 2 and significance level of 0.05, the critical value from the Chi-Square table is 5.991.
Null hypothesis isn’t rejected for test value equal to or less than 5.991 and the null hypothesis
should be rejected for test value more than 5.991.
Step 4: Conduct the Kruskall Wallis Test
1. Rank the waiting times on each hospital from the shortest to the longest:
St. Luke’s Swedish Piedmont

Time Rank Time Rank Time Rank

56 9 103 20 42 5.5*

39 4 87 16 38 2.5*

48 7 51 8 89 17.5*

38 2.5* 95 19 75 15

73 14 68 13 35 1

60 10 42 5.5* 61 11

62 12 107 21

89 17.5*

ΣR1=58.5 ΣR2=120 ΣR3=52.5

*Rankings for same number, for example there are two 38 in the data after number 35
in rank 1, so those two 38s should occupy rank 2 & 3 respectively, but can’t since
they’re the same value, so an average rank is calculated by (2+3)/2.
● ΣR1, ΣR2, ΣR3 is the sum of the ranks on each population to be used for the test.

2. Input the numbers found and available in the Kruskall-Wallis formula to conduct the test. 𝟏𝟐
𝑯= [
(𝟓𝟖. 𝟓)𝟐𝟏𝟐𝟎)𝟐𝟓𝟐. 𝟓)𝟐 𝟐𝟏(𝟐𝟏 + 𝟏) 𝟕 +𝟖 +] − 𝟑(𝟐𝟏 + 𝟏) = 𝟓. 𝟑𝟖𝟔

The test result yields a value of H= 5.38 which will be compared with the critical value found
from the chi-square table to determine whether the null hypothesis should be rejected or not.
Step 5: Make a decision regarding the null hypothesis.
The test yields a value of 5.38, which is less than the critical value of 5.991, so the decision
should be to fail to reject the null hypothesis.
Step 6: Interpretation of the result.
Since the null hypothesis didn’t get rejected, this means that the waiting times in the three
hospitals are the same.

Rank-Order Correlation Test


Used as an alternative to the coefficient of correlation test of hypothesis. There are several
conditions where the coefficient of correlation is not appropriate for use and the Rank-Order is
used instead:
1. When the scale of measurement of one out of two variables is ranked.
2. A non-linear relationship between the two variables.
3. When there’s one or more data point different from the majority.
Similar to the coefficient of correlation test, this test also uses Spearman’s Coefficient of Rank
Correlation formula.

● d is the difference between the ranks of each pair


● n is the number of paired observations (no of subjects involved)
The testing method is the same as the single coefficient correlation test (please refer back to
module 8 for example of application) to determine whether the two variables has a strong
positive, weak positive, weak negative or strong negative correlation.

Sample Problem

Recent studies focus on the relationship betweeen the age of online shoppers and the
number of minutes spent browsing on the internet. Table shows a sample of 15 online shoppers
who actually made a purchase last week. Included is their age and the time, in minutes, spent
browsing on the internet last week.
SHOPPERS AGE BROWSING TIME
(MINUTES)
SPINA 28 342
GORDON 50 125
SCHNUR 44 121
ALVEAR 32 257
MYERS 55 56
LYONS 60 225
HARBIN 38 185
BOBKO 22 141
KOPPEL 21 342
ROWATTI 45 169
MONAHAN 52 218
LANOUE 33 241
ROLL 19 583
GOODALL 17 394
BRODERICK 21 249

Solution
SHOPPERS AGE AGE BROWSING BROWSING D D2
RANK TIME RANK
(MINUTES)
SPINA 28 6.0 342 12.5 -6.50 42.25

GORDON 50 12.0 125 3.0 9.00 81.00


SCHNUR 44 10.0 121 2.0 8.00 64.00
ALVEAR 32 7.0 257 11.0 -4.00 16.00
MYERS 55 14.0 56 1.0 13.00 169.00
LYONS 60 15.0 225 8.0 7.00 49.00
HARBIN 38 9.0 185 6.0 3.00 9.00
BOBKO 22 5.0 141 4.0 1.00 1.00
KOPPEL 21 3.5 342 12.5 -9.00 81.00
ROWATTI 45 11.0 169 5.0 6.00 36.00
MONAHAN 52 13.0 218 7.0 6.00 36.00
LANOUE 33 8.0 241 9.0 -1.00 1.00
ROLL 19 2.0 583 15.0 -13.00 169.00
GOODALL 17 1.0 394 14.0 -13.00 169.00
BRODERICK 21 3.5 249 10.0 -6.50 42.25
Sum d2 965.50
The coefficient of rank correlation is -.724, found by using formula :
Rs = 1- 6∑𝑑2 = 1- 6 (965.5) = 1 – 1.724 = - 0.724
n (n2 -1) 15 (152 -1)
The value of -0.724 indicates a fairly strong negative association between age of the internet
shopper and the minutes spent browsing.

The only additional component to this test is testing the significance of rs using t distribution as
its critical value. The additional test is done using the hypothesis test, rank correlation formula:
𝑛
−2
𝑡 = 𝑟𝑠√1 − 𝑟

Example:
Using a significance level of 0.05, an rs of -0.724 and n of 15, conduct the hypothesis test of rank
correlation!
Solution:
H0: The rank correlation in the population is zero
H1: There is a negative correlation amongst the variables in the population. Df=15-2= 13
Using the Df and the significance level, we can find the critical value for this test in t distribution
table for one tailed test, which yields a value of -1.771
15 − 2 (
𝑡 = (−0.724)√ )2 = −3.784

1 − −0.724
The t-test yields a result of -3.784, which is less than the critical value of -1.771, so the
conclusion would be to reject the null hypothesis, meaning that there is a negative correlation
amongst the variables in the population.
REVIEW QUESTIONS
PROBLEM 12.1
Many new stockbrokers resist giving presentations to bankers and certain other groups.
Sensing this lack of self-confidence, management arranged to have a confidence-building
seminar for a sample of new stockbrokers and enlisted Career Boosters for a three-week course.
Before the first session, Career Booster measured the level of confidence of each participant. It
was measured again after the three-week seminar. The before and after levels of self confidence
for the 14 in the course are shown below. Self-confidence was classified as being either
negative,low,high, or very high.

STOCKBROKER BEFORE AFTER SEMINAR SIGN OF


SEMINAR DIFFERENCE
FORD NEGATIVE LOW +
WALKER LOW HIGH +
DINGH LOW HIGH +
JONES JR VERY HIGH LOW -
HAMMER LOW HIGH +
SKEEN NEGATIVE NEGATIVE 0
SIMMER NEGATIVE HIGH +
ORPHEY LOW VERY HIGH +
MARTIN LOW HIGH +
ARTHUR NEGATIVE LOW +
MURPHY LOW HIGH +
PIERRE NEGATIVE LOW +
LOPEZ LOW HIGH +
JAGGER LOW VERY HIGH +

The purpose of this study is to find whether Career Boosters was effective in raising the self-
confidence of the new stockbrokers. That is,was the level of self-confidence higher after the
seminar than before it? Use the .05 significance level
PROBLEM 12.2
The assembly area of Matthew Product was recently redesigned. Installing a new lighting
system and purchasing a new workbench were two features of the redesign. The production
supervisor would like to know if the changes resulted in improved worker productivity. To
investigate, she selected a sample of 11 workers and determined the production rate before and
after the changes. The sample information is reported below.

Operator Production Before Production After


L. M. 22 10

A. B. 25 15

S. Z. 20 17

B. B. 22 25

M. F. 30 24

S. S. 16 16

C. L 23 21

H. R. 26 23

M. N. 18 17

S. N. 19 20

E. L. 28 10

(a) Use the Wilcoxon signed-rank test to determine whether the new procedures actually
increased production. Use the .05 level and a one-tailed test.
(b) What assumption are you making about the distribution of the differences in production
before and after redesign?

PROBLEM 12.3
The regional bank manager of Capital Financial Bank is interested in the number of
transactions accounting in personal checking account at four of the bank’s branches. Each branch
randomly samples a number of personal checking accounts and records the number of
transactions made in each account over the last six months. The results are in the table below.
Using the .01 level and the Kruskal-Wallis Test,determine whether there is a difference in the
number personal checking account transactions among the four branches.
EASTERN WEST SIDE NORTHERN SOUTH SIDE
BRANCH BRANCH BRANCH BRANCH
340 100 296 80

180 99 91 86

319 189 307 91

302 116 142 62

103 131 208 91

103 199

PROBLEM 12.4
A sample of individuals applying for manufacturing jobs at Kevin Enterprises revealed
the following score o nan eye perception test (X) and a mechanical aptitude test (Y) :

SUBJECT EYE PERCEPTION MECHANICAL


APTITUDE

01 682 40

02 840 42

03 777 62

04 805 30

05 810 28
06 777 55

07 820 51

08 777 70

09 820 60

10 805 23

(a) Compute the coefficient of rank correlation between eye percerption and mechanical aptitude.

(b) At the .05 significance level, can we conclude that the correlation in the population is
different from 0?
Appendix 1 : Z TABLE
Appendix 2 : t TABLE
ontinued
Appendix 3 : F TABLE (0.5)
Appendix 4 : F TABLE (0.1)
Appendix 5 : CHI SQUARE TABLE
Appendix 6 : WILCOXON T VALUE

You might also like