You are on page 1of 41

Lecture Statistics

One-Sample non-Parametric Tests


Chi-Square Test
6 Steps in
Hypothesis Testing

1. State the null hypothesis, H0 and the


alternative hypothesis, H1
2. Choose the level of significance, , and the
sample size, n
3. Determine the appropriate test statistic and
sampling distribution
4. Determine the critical values that divide the
rejection and nonrejection regions
6 Steps in
Hypothesis Testing (continued)

5. Collect data and compute the value of the


test statistic
6. Make the statistical decision and state the
managerial conclusion. If the test statistic
falls into the nonrejection region, do not
reject the null hypothesis H0. If the test
statistic falls into the rejection region, reject
the null hypothesis. Express the managerial
conclusion in the context of the problem
One-sample Hypothesis Testing
• Nominal Data: Chi-Square Test (Non-parametric
test)

• Ordinal Data: Kolmogorov-Smirnov (K-S) Test


(Non-Parametric Test)

• Interval/Ratio Data: Z-test or t-test (Parametric test)


Chi-Square Test for Nominal
(Categorical) Data
A chi-square statistic is used to investigate whether
distributions differ from one another.

• One variable—how its distribution compares to a


second, given distribution(a test of goodness of fit)

• Two variables—tests whether the two variables are


related (statistically) to each other (a test for
independence)
Chi-square statistic

(observed frequency − expected frequency) 2


 =
2

expected frequency
Note that chi-square tests can only be used on actual
numbers and not on percentages, proportions, etc

For reasonably large n, the above statistic under Ho has an


approximate chi-squared distribution with (k-1) degrees of
freedom, where k is the no of categories
Chi-square (cont.)

• Here’s the distribution:


It is actually a family of
distributions depending on
the degrees of freedom

• if there is in fact no relationship between two variables, if


you draw repeated samples and calculate the formula on the
last slide, you’ll get this kind of distribution simply because
of chance variation.

• What we do in practice is to draw one sample and make the


calculation. If the number we get is large enough, it tells us
that we almost certainly didn’t get this number by chance.
i.e., there really is a relationship between these variables in
the population.
Chi-Square Statistical Table
• Chi-sq critical values
• (area to the right of crit val.)
Chi-Square Distributions
Example 1:
Car accidents and day of the week

A study of 667 drivers who were using a cell phone when they were involved
in a collision on a weekday examined the relationship between these
accidents and the day of the week.

Are the accidents equally likely to occur on any day of the working week?
To answer these questions we use the chi-square
goodness of fit test

Data for n observations on a categorical variable


(for example, day of week,)

with k possible outcomes


(k=5 weekdays)

are summarized as observed counts, n1, n2, . . . , nk in k


cells.
The Chi-square statistic

(observed frequency − expected frequency) 2


 =
2

expected frequency

Chi-square distribution with k-1 degrees of freedom, where k


is the number of categories

expected frequency if H0 is true


Decision Rule
2
The χ STAT test statistic approximately follows a chi-
squared distribution with (k-1) degrees of freedom

Decision Rule:
χ 2
If STAT  χ 2
α , reject H0,
otherwise, do not reject 
H0
0
Do not Reject H0 2
reject H0 2α

Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc.. Chap 12-14


The Chi-Square Test Statistic
• The χ2 test statistic approximately follows a chi-squared
distribution with k-1 degrees of freedom, where k is the
number of categories.
• If the χ2 test statistic is large, this is evidence against the null
hypothesis.
(Obs − Exp ) 2
2 = 
all cells Exp

Decision Rule: .05


If χ 2
 χ 2
.05 ,reject H0,

otherwise, do not reject 0


H0. Do not Reject H0 2
reject H0
2.05
Car accidents and day of the week
(compare X2 to table value)
H0 specifies that all days are equally likely for
car accidents ➔ each pi = 1/5.

The expected count for each of the five days is npi = 667(1/5) = 133.4.

2 2
(observed - expected) (count - 133.4)
2 =  = day
= 8.49
expected 133.4
Following the chi-square distribution with 5 − 1 = 4 degrees of freedom.
p
df 0.25 0.2 0.15 0.1 0.05 0.025 0.02 0.01 0.005 0.0025 0.001 0.0005
1 1.32 1.64 2.07 2.71 3.84 5.02 5.41 6.63 7.88 9.14 10.83 12.12
2 2.77 3.22 3.79 4.61 5.99 7.38 7.82 9.21 10.60 11.98 13.82 15.20
3 4.11 4.64 5.32 6.25 7.81 9.35 9.84 11.34 12.84 14.32 16.27 17.73
4 5.39 5.99 6.74 7.78 9.49 11.14 11.67 13.28 14.86 16.42 18.47 20.00
5 6.63 7.29 8.12 9.24 11.07 12.83 13.39 15.09 16.75 18.39 20.51 22.11
6 7.84 8.56 9.45 10.64 12.59 14.45 15.03 16.81 18.55 20.25 22.46 24.10
Since
7 the
9.04value 8.49
9.80 of
10.75 the test
12.02 statistic
14.07 is
16.01 less than
16.62 the
18.48 table
20.28 value
22.04of 9.49,
24.32 we
26.02
8 10.22 11.03 12.03 13.36 15.51 17.53 18.17 20.09 21.95 23.77 26.12 27.87
do9 not 11.39
reject H12.24
0 13.29 14.68 16.92 19.02 19.68 21.67 23.59 25.46 27.88 29.67
10 12.55 13.44 14.53 15.99 18.31 20.48 21.16 23.21 25.19 27.11 29.59 31.42
➔11There is
13.70 no significant
14.63 15.77 evidence
17.28 19.68of different
21.92 car
22.62 accident
24.72 rates
26.76 for
28.73 different
31.26 33.14
12 14.85 15.81 16.99 18.55 21.03 23.34 24.05 26.22 28.30 30.32 32.91 34.82
weekdays
13 15.98 when
16.98 the driver
18.20 was
19.81 using
22.36 a cell
24.74 phone.
25.47 27.69 29.82 31.88 34.53 36.48
Car accidents and day of the week
(bounds on P-value)
H0 specifies that all days are equally likely for
car accidents ➔ each pi = 1/5.

The expected count for each of the five days is npi = 667(1/5) = 133.4.
2 2
(observed - expected) (count - 133.4)
2 =  = day
= 8.49
expected 133.4
Following the chi-square distribution with 5 − 1 = 4 degrees of freedom.
p
df 0.25 0.2 0.15 0.1 0.05 0.025 0.02 0.01 0.005 0.0025 0.001 0.0005
1 1.32 1.64 2.07 2.71 3.84 5.02 5.41 6.63 7.88 9.14 10.83 12.12
2 2.77 3.22 3.79 4.61 5.99 7.38 7.82 9.21 10.60 11.98 13.82 15.20
3 4.11 4.64 5.32 6.25 7.81 9.35 9.84 11.34 12.84 14.32 16.27 17.73
4 5.39 5.99 6.74 7.78 9.49 11.14 11.67 13.28 14.86 16.42 18.47 20.00
5 6.63 7.29 8.12 9.24 11.07 12.83 13.39 15.09 16.75 18.39 20.51 22.11
67.78 <
7.84 2 8.56 9.45 10.64 12.59 14.45 15.03 16.81 18.55
X = 8.49 < 9.49 Thus the bounds on the P-value are 0.05 < P-value 20.25 22.46 24.10
< 0.1
7 9.04 9.80 10.75 12.02 14.07 16.01 16.62 18.48 20.28 22.04 24.32 26.02
We 8don’t10.22
know11.03the exact
12.03 P-value
13.36 but
15.51 we17.53
DO know
18.17 that P-value
20.09 21.95 > 23.77
0.05, thus
26.12 we27.87
9 11.39 12.24 13.29 14.68 16.92 19.02 19.68 21.67 23.59 25.46 27.88 29.67
conclude
10
that
12.55

13.44 14.53 15.99 18.31 20.48 21.16 23.21 25.19 27.11 29.59 31.42
➔11There is no14.63
13.70 significant
15.77 evidence
17.28 19.68of different
21.92 car accident
22.62 24.72 rates28.73
26.76 for different
31.26 33.14
12 14.85 15.81 16.99 18.55 21.03 23.34 24.05 26.22 28.30 30.32 32.91 34.82
weekdays
13 15.98
when
16.98
the driver
18.20
was
19.81
using
22.36
a cell
24.74
phone.
25.47 27.69 29.82 31.88 34.53 36.48
Example 2: M & M Colors

❑Mars, Inc. periodically changes the M&M (milk chocolate)


color proportions. Last year the proportions were:
yellow 20%; red 20%, orange, blue, green 10% each; brown
30%

• In a recent bag of 106 M&M’s I had the following numbers


of each color:
Yellow Red Orange Blue Green Brown
29 (27.4%) 23 (21.7%) 12 (11.3%) 14 (13.2%) 8 (7.5%) 20 (18.9%)

• Is this evidence that Mars, Inc. has changed the color


distribution of M&M’s?
Example 2: M & M Colors
• H0 : pyellow=.20, pred=.20, porange=.10, pblue=.10,
pgreen=.10, pbrown=.30

Yellow Red Orange Blue Green Brown Total


Obs. 29 23 12 14 8 20 106
Exp. 21.2 21.2 10.6 10.6 10.6 31.8 106

• Expected yellow = 106*.20 = 21.2, etc. for other


expected counts.
(Obs − Exp ) 2 (29 − 21.2) 2 (23 − 21.2) 2
 2
= 
all cells Exp
=
21.2
+
21.2
+

(12 − 10.6) 2 (14 − 10.6) 2 (8 − 10.6) 2 (20 − 31.8) 2


+ + +
10.6 10.6 10.6 31.8
= 2.87 + 0.153 + 0.185 + 1.091 + 0.638 + 4.379
= 9.316
Example 2: M & M Colors (cont.)

 2 = 9.316;degrees of freedom = 6 − 1 = 5

The test statistic is χ 2 = 9.316 ; χ0.05


2
with 5 d.f. = 11.070
Decision Rule:
If χ 2  χ.05
2
,reject H0,
otherwise, do not reject
H0.

0.05 Here,
χ 2
= 9.316 < χ.05
2
= 11.070,
so we do not reject H0 and
0 conclude that there is not
Do not Reject H0 2 sufficient evidence to conclude
reject H0
20.05 = 11.070 that Mars has changed the color
proportions.
Example

Computer systems crash for many reasons, among them


software failure, hardware failure, operator error, and system
overloading. It is throughtthat10% of the crashes are due to
software failure, 5% due to hardware failure, 25% due to
operator error, 40% to system overloading and the rest to other
causes. Over an extended period of study 150 crashes are
observed and each is classified according to its probable cause.
It is found that 13 are due to software failure, 10 to hardware
failure, 42 to operator failure, 65 to system overloading, and
the rest to other causes. Do these data lead us to suspect the
accuracy of the stated percentages?
Contingency Tables
• A contingency table is a method of summarizing the
relationship between variables. It is a table of frequencies
classified according to the values of the variables in
question. It is used to summarise categorical data. What you
find in the rows of a contingency table is contingent
upon(dependent upon) what you find in the column.

• Used to classify sample observations according to two or


more characteristics

• Also called a cross-classification table.


Contingency Table Example
Left-Handed vs. Gender
Dominant Hand: Left vs. Right
Gender: Male vs. Female

▪ 2 categories for each variable, so


called a 2 x 2 table

▪ Suppose we examine a sample of


300 children
Contingency Table Example(continued)
Sample results organized in a contingency table:

Hand Preference
sample size = n = 300:
Gender Left Right
120 Females, 12
were left handed Female 12 108 120
180 Males, 24 were
left handed Male 24 156 180

36 264 300
Testing for independence

The null hypothesis is that the row and column


variables are independent. The alternative
hypothesis is that the row and column variables
are dependent.

H0: The two categorical variables are independent


(i.e., there is no relationship between them)
H1: The two categorical variables are dependent
(i.e., there is a relationship between them)
Test for the Equality Between
Proportions
H0: π1 = π2 (Proportion of females who are left
handed is equal to the proportion of
males who are left handed)
H1: π1 ≠ π2 (The two proportions are not the same –
hand preference is not independent
of gender)

• If H0 is true, then the proportion of left-handed females


should be the same as the proportion of left-handed males.
Left-handedness is then independent of gender.
Chi-square tests for independence
(Obs − Exp ) 2
 =  2

all cells Exp


❑Expected cell frequencies:
row total  column total
Exp =
n

Where:
row total = sum of all frequencies in the row
column total = sum of all frequencies in the column
n = overall sample size
df=(r-1)*(k-1)
Computing the
Average Proportion
The average X1 + X2 X
p= =
proportion is: n1 + n2 n

120 Females, 12 Here:


were left handed
12 + 24 36
180 Males, 24 were p= = = 0.12
left handed
120 + 180 300

i.e., of all the children the proportion of left handers is 0.12,


that is, 12%
Finding Expected Frequencies

• To obtain the expected frequency for left handed females,


multiply the average proportion left handed (p) by the total
number of females
• To obtain the expected frequency for left handed males,
multiply the average proportion left handed (p) by the total
number of males
If the two proportions are equal, then
P(Left Handed | Female) = P(Left Handed | Male) = .12

i.e., we would expect (.12)(120) = 14.4 females to be left handed


(.12)(180) = 21.6 males to be left handed
Observed vs. Expected
Frequencies

Hand Preference
Gender Left Right
Observed = 12 Observed = 108
Female 120
Expected = 14.4 Expected = 105.6
Observed = 24 Observed = 156
Male 180
Expected = 21.6 Expected = 158.4

36 264 300
The Chi-Square Test Statistic
Hand Preference
Gender Left Right
Observed = 12 Observed = 108
Female 120
Expected = 14.4 Expected = 105.6
Observed = 24 Observed = 156
Male 180
Expected = 21.6 Expected = 158.4
36 264 300
The test statistic is:
(f o − f e ) 2
χ 2STAT = 
all cells
fe
(12 − 14.4) 2 (108 − 105.6) 2 (24 − 21.6) 2 (156 − 158.4) 2
= + + + = 0.7576
14.4 105.6 21.6 158.4
Decision Rule
2
The test statistic is χ STAT = 0.7576 ; χ 02.05 with 1 d.f. = 3.841

Decision Rule:
2
If χ STAT > 3.841, reject H0,
otherwise, do not reject H0

Here,
2 2
0.05 χ STAT = 0.7576< χ 0.05 = 3.841,
so we do not reject H0 and
0 conclude that there is not
Do not Reject H0 2 sufficient evidence that the two
reject H0
20.05 = 3.841 proportions are different at  =
0.05
Example 2: meal plan selection
• The meal plan selected by 200 students is shown below:

Number of meals per week


Class
Standing 20/week 10/week none Total
Level 1 24 32 14 70
Level 2 22 26 12 60
Level 3 10 14 6 30
Level 4 14 16 10 40
Total 70 88 42 200
Example 2: meal plan selection
(cont.)
• The hypotheses to be tested are:

H0: Meal plan and class standing are independent


(i.e., there is no relationship between them)
H1: Meal plan and class standing are dependent
(i.e., there is a relationship between them)
Example 2: meal plan selection
(cont.)
Expected Cell Frequencies
Observed:
Number of meals
per week
Class Expected cell frequencies if H0 is
Standing 20/wk 10/wk none Total true:
Level 1 24 32 14 70
Level 2. 22 26 12 60 Number of meals
Level 3 10 14 6 30 Class per week
Level 4 14 16 10 40 Standing 20/wk 10/wk none Total
Total 70 88 42 200 Level 1 24.5 30.8 14.7 70

Example for one cell: Level 2 21.0 26.4 12.6 60


Level 3 10.5 13.2 6.3 30
row total  column total
Exp = Level 4 14.0 17.6 8.4 40
n
30  70 Total 70 88 42 200
= = 10.5
200
Example 2: meal plan selection
(cont.) The Test Statistic
• The test statistic value is:

(Obs − Exp ) 2
 = 
2

all cells Exp


(24 − 24.5) 2 (32 − 30.8) 2 (10 − 8.4) 2
= + + + = 0.709
24.5 30.8 8.4

χ 0.2 05 = 12.592 from the chi-squared distribution


with (4 – 1)(3 – 1) = 6 degrees of freedom
Example 2: meal plan selection
(cont.)
Decision and Interpretation
The test statistic is  2 = 0.709 ; 0.05
2
with 6 d.f. = 12.592

Decision2 Rule:
If  > 12.592, reject H0, otherwise,
do not reject H0

0.05 Here,
2 2
= 0.709 < χ 0.05 = 12.592,
so do not reject H0
0
Do not Reject H0 2 Conclusion: there is not
reject H0 sufficient evidence that meal
20.05=12.592 plan and class standing are
related.
2 Test of Independence
The Chi-square test statistic is:
( fo − fe )2
2
χ STAT = 
all cells
fe
n where:
fo = observed frequency in a particular cell of the r x c table
fe = expected frequency in a particular cell if H0 is true

χ 2STAT for the r x c case has (r - 1)(c - 1) degrees of freedom

(Assumed: each cell in the contingency table has expected


frequency of at least 1)
Expected Cell Frequencies
• Expected cell frequencies:

row total  column total


fe =
n

Where:
row total = sum of all frequencies in the row
column total = sum of all frequencies in the column
n = overall sample size
Decision Rule
• The decision rule is

2
If χ STAT  χ α2 , reject H0,
otherwise, do not reject H0

2
Where χ α is from the chi-squared distribution
with (r – 1)(c – 1) degrees of freedom
Example
Suppose you have the following categorical data for 3
types of flu in three different regions

Asia Europe America Totals


Flu A 30 15 45 90
Flu B 2 5 53 60
Flu C 53 45 2 100
Totals 85 65 100 250

Is there a relationship between location and types of


flu?

You might also like