CHP 15 - STAT 245 Summer 2021

Analysis of Categorical Data
Goodness of Fit
I The multinomial experiment :
1. The experiment consists of n identical trials.
2. The outcome of each trial falls into one of k
categories.
3. The probability that the outcome of a single trial falls
into a particular category, say category i, is pi and
remains constant from trial to trial. This probability
must be between 0 and 1, for each of the k
categories, and the sum of all k probabilities is
k
X
pi = 1 . (1)
i=1
4. The trials are independent.

5. The experimenter counts the observed number of
outcomes in each category, written as O1 , O2 ,...Ok ,
with
O1 + O2 + ... + Ok = n . (2)
Goodness of Fit
I Recall (from Chapter 4 in PSY 233) that a binomial experiment

consists of n identical trials with probability of success p on
each trial. The probability of k successes in n trials is :
n!
P(x = k) = pk q n−k (3)
k!(n − k )!
for all values of k = 0, 1, 2, ..., n.

I For a multinomial experiment, we have
n!
P(x1 , x2 , ..., xk ) = px1 px2 ...pkxk (4)
x1 !x2 !...xk ! 1 2
Goodness of Fit
I Suppose we want to hypothesize values for each of the
probabilities p1 , p2 , ...,pk as
H0 : P1 = p1 , P2 = p2 , ... , Pk = pk
vs H1 : At least one of the probabilties is different
Pearson’s Chi–Square Test X 2 is used :

I
k 2
X (Oi − Ei )
X2 = (5)
Ei
i=1
summed over all k cells, with
Ei = npi . (6)
I Find Xk2−1,α from Chi–Square table.

2 2
I Reject if Xcalc > Xk−1,α .
Question of Direct Relevance to Exam
I Example : Do animals distinguish colors?? One experiment is

designed to investigate whether snakes can distinguish colors
by being attracted to the end of a narrow solid, where it divides
to three different easy–to–open coloured doors. Observations
are taken for n = 90 snakes. Do snakes acquire a preference for
one of the three different coloured doors?
Green Red Blue
Observed Count (Oi ) 20 39 31
(30) (30) (30)

Solution
I Solution : If the snake has no preference on any door, then the

snake will choose each door an equal number of times. That is
1 1 1
H0 : P1 = , P2 = , P3 =
3 3 3
1
vs H1 : At least one pi is different from
3
I To find the expected cell counts for each of the categories :

1
E1 = np1 = (90) = 30
3

1
E2 = np2 = (90) = 30
3

1
E3 = np3 = (90) = 30
3
Solution
I
3 2
2
X (Oi − Ei )
Xcalc =
Ei
i=1
(O1 − E1 )2 (O2 − E2 )2 (O3 − E3 )2
= + +
E1 E2 E3
2 2
(20 − 30) (39 − 30) (31 − 30)2
= + +
30 30 30
= 6.067
2
I Reject if Xcalc > Xk2−1,α (slide 4).
I From X 2 table : Xk2−1,α = X3−1,0.05
2 2
= X2,0.05 = 5.99.
I 2
Since Xcalc 2
> X2,0.05 =⇒ 6.067 > 5.99 =⇒ H0 is
rejected.
I The hypothesis of no preference is rejected. There is sufficient
evidence to indicate that the snake has a preference for one of
the three doors.
I Example : Osteosarcoma is one of the very malignant types of

cancer in which the cancerous tumor occurs in a bone. Patients
are often treated with one of the following four options :
“Surgery”, “Chemotherapy”, “Radiation Therapy”, and
“Amputation”. In a recent study, 1000 patients with
Osteosarcoma were asked to select/prefer the treatment
program to follow (three months). The following data, is about all
patient responses for four different types of treatment :
Treatment Surgery Chemotherapy Radiation Amputation

Type Therapy
Observed 294 276 238 192

Count (Oi )
() () () ()
Do the data present sufficient evidence to indicate that

some treatments are preferred over others? Test using α =
0.05.
Solution
I Solution : If the driver has no preference in the choice of the
treatment program, then each treatment type has to be chosen
an equal number of times. That is
1 1 1 1
H0 : P1 = , P2 = , P3 = , P4 =
4 4 4 4
1
vs H1 : At least one pi is different from
4

1
E1 = np1 = (1000) = 250
4

1
E2 = np2 = (1000) = 250
4

1
E3 = np3 = (1000) = 250
4

1
E4 = np4 = (1000) = 250
4
Solution
I Solution :
Treatment Surgery Chemotherapy Radiation Amputation

Type Therapy
Observed 294 276 238 192

Count (Oi )
(250) (250) (250) (250)
Solution
4 2
2
X (Oi − Ei )
Xcalc =
Ei
i=1
(O1 − E1 )2 (O2 − E2 )2 (O3 − E3 )2 (O4 − E4 )2
= + + +
E1 E2 E3 E4
(294 − 250)2 (276 − 250)2 (238 − 250)2 (192 − 250)2
= + + +
250 250 250 250
= 24.48
2
2 2
= X3,0.05 = 7.82.
I 2
Since Xcalc 2
> X3,0.05 =⇒ 24.48 > 7.82 =⇒ H is 0
rejected.
I The hypothesis of no preference is rejected.
I Example : Previous records of Covid–19 in a large town indicate

that of the total number of patients who had Covid–19 during
2020, 60% are during December 2020, 5% are during August
2020, and the remainder were during April 2020. In a recent
study, 500 patients were involved, where 329 confirmed being
tested positively for Covid–19 during December 2020, 43 during
August 2020, and the remainder during April 2020.
Covid–19 December 2020 August 2020 April 2020
Observed (Oi ) 329 43 128
Do these data indicate a departure from previous record

rates? Test using α = 0.05.
Solution
I Solution : The hypothesis to be tested is :
H0 : P1 = 0.6, P2 = 0.05, P3 = 0.35
vs H1 : At least one pi is different

E1 = np1 = (500) (0.6) = 300
E2 = np2 = (500) (0.05) = 25
E3 = np3 = (500) (0.35) = 175
Covid–19 December 2020 August 2020 April 2020
Observed (Oi ) 329 43 128
(300) (25) (175)

Solution
3 2
2
X (Oi − Ei ) (O1 − E1 )2 (O2 − E2 )2 (O3 − E3 )2
Xcalc = = + +
Ei E1 E2 E3
i=1
(329 − 300)2 (43 − 25)2 (128 − 175)2
= + +
300 25 175
= 28.39
2
2 2
= X2,0.05 = 5.99.
I 2
Since Xcalc 2
> X2,0.05 =⇒ 28.39 > 5.99 =⇒ H is 0
rejected.
I There is a departure from previous record rates.
I Example : Given the frequency table data in the table below,

compute the variance and standard deviation of the data using
the grouped data formula. Is the given population normally
distributed?? Use α = 0.05.
Frequency table
Class Class fi xmi fi · xmi 2

xm 2
fi · xm
i i
Boundaries
1 89.5–104.5 24
2 104.5–119.5 62
3 119.5–134.5 72
4 134.5–149.5 26
5 149.5–164.5 12
6 164.5–179.5 4
n
X n
X n
X
2
Sums fi = fi xmi = fi xm i
i=1 i=1 i=1
=
Solution
I Frequency table

xm 2
fi · xm
i i
Boundaries
1 89.5–104.5 24 97 2328 9409 225,816
2 104.5–119.5 62 112 6944 12,544 777,728
3 119.5–134.5 72 127 9144 16,129 1,161,288
4 134.5–149.5 26 142 3692 20,164 524,264
5 149.5–164.5 12 157 1884 24,649 295,788
6 164.5–179.5 4 172 688 29,584 118,336
n
X n
X Xn
2
Sums fi = 200 fi xmi fi xm i
i=1 i=1 i=1
=24,680 =3,103,220
Solution
I Now,
n
!2
X
n
fi · xmi
X i=1
fi · xm2 i −
n
i=1
s2 =
n−1
h 2
i
3, 103, 220 − 24680
200
=
200 − 1
and
=⇒s = √
290 = 17.03
n
X
fi xmi
i=1 24680
x̄ = = = 123.4
n 200
=⇒ s 17.03
CVar = × 100% = × 100% = 13.8%
x̄ 123.4
Solution
I Frequency table
Class Class fi z–transformed Area Ei Oi

Boundaries
1 89.5–104.5 24 −∞ to -1.11 p1 = A1 = 0.1335 E1 = 26.7 24

2 104.5–119.5 62 -1.11 to -0.23 p2 = A2 = 0.2755 E2 = 55.1 62
3 119.5–134.5 72 -0.23 to 0.65 p3 = A3 = 0.3332 E3 = 66.64 72
4 134.5–149.5 26 0.65 to 1.53 p4 = A4 = 0.1948 E4 = 38.96 26
5 149.5–164.5 12 1.53 to 2.41 p5 = A5 = 0.0550 E5 = 11.0 12
6 164.5–179.5 4 2.41 to +∞ p6 = A6 = 0.0080 E6 = 1.6 4
X n Xn Xn X n
Sums fi Pi Ei Oi
i=1 i=1 i=1 i=1
= 200 =1 = 200 = 200
n
!
X
where Ei = npi = fi Ai and Oi = fi .
i=1
Solution
I The null and alternative hypotheses are

H0 : The population is normally distributed.
vs H1 : The population is NOT normally distributed.

p1 = P (Z ≤ −1.11) = 0.1335
p2 = P (Z ≤ −0.23) − P (Z ≤ −1.11) = 0.2755
p3 = P (Z ≤ 0.65) − P (Z ≤ −0.23) = 0.3332
p4 = P (Z ≤ 1.53) − P (Z ≤ 0.65) = 0.1948
p5 = P (Z ≤ 2.41) − P (Z ≤ 1.53) = 0.0550
p6 = 1 − P (Z ≤ 2.41) = 0.0080
Solution

E1 = np1 = (200) (0.1335) = 26.7
E2 = np2 = (200) (0.2755) = 55.1
E3 = np3 = (200) (0.3332) = 66.64
E4 = np4 = (200) (0.1948) = 38.96
E5 = np5 = (200) (0.0550) = 11.0
E6 = np6 = (200) (0.0080) = 1.6
Solution
I =⇒X (O − E )
6 2
(24 − 26.7)2 (62 − 55.1)2 (72 − 66.64)2
2 i i
Xcalc = = + +
Ei 26.7 55.1 66.64
i=1
(26 − 38.96)2 (12 − 11)2 (4 − 1.6)2
+ + +
38.96 11 1.6
= 9.57.
2
2 2
= X5,0.05 = 11.07.
I 2
Since Xcalc 2
< X5,0.05 =⇒
9.57 < 11.07
=⇒ H0 is not rejected.
I The population is normally distributed.
I Example : Given the frequency table data in the table below,

compute the variance and standard deviation of the data using
the grouped data formula. Is the given population normally
distributed?? Use α = 0.05.
Frequency table

xm 2
fi · xm
i i
Boundaries
1 5.5–10.5 1
2 10.5–15.5 2
3 15.5–20.5 3
4 20.5–25.5 5
5 25.5–30.5 4
6 30.5–35.5 3
7 35.5–40.5 2
n
X n
X n
X
2
Sums fi = fi xmi = fi xm i
i=1 i=1 i=1
=
Solution
I Frequency table

xm 2
fi · xm
i i
Boundaries
1 5.5–10.5 1 8 8 64 64
2 10.5–15.5 2 13 26 169 338
3 15.5–20.5 3 18 54 324 972
4 20.5–25.5 5 23 115 529 2645
5 25.5–30.5 4 28 112 784 3136
6 30.5–35.5 3 33 99 1089 3267
7 35.5–40.5 2 38 76 1444 2888
n
X n
X n
X
2
Sums fi = 20 fi xmi = 490 fi xm i
i=1 i=1 i=1
=13310
Solution
I Now,
n
!2
X
n
fi · xmi
X i=1
fi · xm2 i −
n
i=1
s2 =
n−1
h 2
i
490
13310 − 20 13310 − 12005
= = = 68.7
20 − 1 19
and
n
X
fi xmi
i=1 490
x̄ = = = 24.5
n 20
=⇒
s 8.3
CVar = × 100% = × 100% = 33.9%
x̄ 24.5
Solution
I Frequency table

Boundaries
1 5.5–10.5 1
2 10.5–15.5 2
3 15.5–20.5 3
4 20.5–25.5 5
5 25.5–30.5 4
6 30.5–35.5 3
7 35.5–40.5 2
Xn n
X n
X n
X
Sums fi pi Ei Oi
i=1 i=1 i=1 i=1
= 20 =1 = 20 = 20
n
!
X
where Ei = npi = fi pi and Oi = fi .
i=1
Solution
I The null and alternative hypotheses are

H0 : The population is normally distributed.
vs H1 : The population is NOT normally distributed.

p1 = A1 = P (Z ≤ −1.68) = 0.0464
p2 = A2 = P (Z ≤ −1.08) − P (Z ≤ −1.68) = 0.0935
p3 = A3 = P (Z ≤ −0.48) − P (Z ≤ −1.08) = 0.1755
p4 = A4 = P (Z ≤ 0.12) − P (Z ≤ −0.48) = 0.2321
p5 = A5 = P (Z ≤ 0.72) − P (Z ≤ 0.12) = 0.2164
p6 = A6 = P (Z ≤ 1.32) − P (Z ≤ 0.72) = 0.1423
p7 = A7 = 1 − P (Z ≤ 1.32) = 0.0934
Solution

E1 = np1 = (20) (0.046) = 0.929
E2 = np2 = (20) (0.093) = 1.871
E3 = np3 = (20) (0.175) = 3.510
E4 = np4 = (20) (0.232) = 4.642
E5 = np5 = (20) (0.216) = 4.329
E6 = np6 = (20) (0.142) = 2.846
E7 = np7 = (20) (0.093) = 1.868
Solution
I Frequency table

Boundaries
1 5.5–10.5 1 −∞ to -1.68 p1 = 0.0464 E1 = 0.929 1

2 10.5–15.5 2 -1.68 to -1.08 p2 = 0.0935 E2 = 1.871 2
3 15.5–20.5 3 -1.08 to -0.48 p3 = 0.1755 E3 = 3.510 3
4 20.5–25.5 5 -0.48 to 0.12 p4 = 0.2321 E4 = 4.642 5
5 25.5–30.5 4 0.12 to 0.72 p5 = 0.2164 E5 = 4.329 4
6 30.5–35.5 3 0.72 to 1.32 p6 = 0.1423 E6 = 2.846 3
7 35.5–40.5 2 1.32 to ∞ p7 = 0.0934 E2 = 1.868 2
X n Xn Xn X n
Sums fi pi Ei Oi
i=1 i=1 i=1 i=1
= 20 ≈1 ≈20 = 20
n
!
X
where Ei = npi = fi pi and Oi = fi .
=⇒X (O − E ) 7 2
i=1
(1 − 0.929)2 (2 − 1.871)2 (3 − 3.510)2

2 i i
Xcalc = = + +
Ei 0.929 1.871 3.510
i=1
(5 − 4.642)2 (4 − 4.329)2 (3 − 2.846)2 (2 − 1.868)2
+ + + +
4.642 4.329 2.846 1.868
= 0.1598
Solution
2
2 2
= X6,0.05 = 12.592.
I 2
Since Xcalc 2
<< X6,0.05 =⇒ 0.1592 << 12.592 =⇒
H0 is not rejected (accepted strongly).
I The population is normally distributed.
Contingency Tables
I Suppose we want to hypothesize values for each of the
probabilities p11 , p12 , ...,pmn as
H0 : P11 = p11 , P12 = p12 , ... , Pmn = pmn
vs H1 : At least one of the probabilties is different
Pearson’s Chi–Square Test X 2 is used :

I
m X
n 2
2
X Oij − Eij
X = (7)
Eij
i=1 j=1
summed over all (m)(n) cells, with
ri cj
Eij = . (8)
n
I Find X(r2 −1)(c−1),α from Chi–Square table.

2
I Reject if Xcalc > X(r2 −1)(c−1),α .
I Example : A researcher wanted to study the relationship

between gender and having Hernia. He took a sample of 2000
patients from a big database, for patients who had sharp
abdomen pain. He obtained the information given in the
following table :
Hernia No Hernia
Men 640 450
Women 440 470
Can you conclude that gender and having Hernia are

related for all adults? Test using α = 0.05.
Solution
I Solution :
H0 : Gender and Hermia are Independent
vs H1 : Gender and Hernia are not Independent
Hernia No Hernia
Men 640 450 1090

(588.6) (501.4)
Women 440 470 910

(491.4) (418.6)
Total 1080 920

Solution
I To find the expected cell counts for each cell :

r1 c1 (1090)(1080)
E11 = = = 588.6
n 2000
r1 c2 (1090)(920)
E12 = = = 501.4
n 2000
r2 c1 (910)(1080)
E21 = = = 491.4
n 2000
r2 c2 (910)(920)
E22 = = = 418.6
n 2000
Solution
I To compute X 2 value :
X Oij − Eij 2

2
Xcalc =
Eij
i,j
(O11 − E11 )2 (O12 − E12 )2 (O21 − E21 )2 (O22 − E22 )2

= + + +
E11 E12 E21 E22
2 2 2
(640 − 588.6) (450 − 501.4) (440 − 491.4)
= + +
588.6 501.4 491.4
2
(470 − 418.6)
+ = 4.489 + 5.269 + 5.376 + 6.311 = 21.445
418.6
2
I Reject if Xcalc > X(r2 −1)(c−1),α (slide 30).
I From X 2 table : X(r2 −1)(c−1),α = X(2−1)(2−1),0.05
2 2
= X1,0.05 = 3.841.
I 2
Since Xcalc 2
> X1,0.05 =⇒ 21.445 > 3.841 =⇒
H0 is
rejected.
I Gender and Hernia are not independent. They are related.
I Example : A survey was conducted on a new Covid–19 vaccine.

The vaccine was delivered in a three–shot sequence over a
period of three weeks. Some people received three shots, or
appeared for only the second shot, others received only one
shot. The following table gives the records of 1000 residents in a
small community.
One Shot Two Shots Three Shots
Covid–19 24 9 13
No Covid–19 289 100 565
Do the data present sufficient evidence that the vaccine

was successful in reducing the number of Covid–19 cases
in the small community? Test using α = 0.05.
Solution
I Solution :
H0 : Vaccine and Covid–19 are Independent
vs H1 : Vaccine reduces Covid–19 Cases
One Shot Two Shots Three Shots
Covid–19 24 9 13 46
(14.4) (5.01) (26.59)
No Covid–19 289 100 565 954

(298.6) (103.99) (551.41)
Total 313 109 578 1000

Solution
I To find the expected cell counts for each cell :

r1 c 1 (46)(313)
E11 = = = 14.4
n 1000
r1 c 2 (46)(109)
E12 = = = 5.01
n 1000
r1 c 3 (46)(578)
E13 = = = 26.59
n 1000
r2 c 1 (954)(313)
E21 = = = 298.6
n 1000
r2 c 2 (954)(109)
E22 = = = 103.99
n 1000
r2 c 3 (954)(578)
E23 = = = 551.41
n 1000
Solution
I To compute X 2 value :
X Oij − Eij 2

2 (O11 − E11 )2 (O12 − E12 )2 (O13 − E13 )2
Xcalc = = + +
Eij E11 E12 E13
i,j
(O21 − E21 )2 (O22 − E22 )2 (O23 − E23 )2

+ + +
E21 E22 E23
2 2
(24 − 14.4) (9 − 5.01) (13 − 26.59)2
= + +
14.4 5.01 26.59
(289 − 298.6)2 (100 − 103.99)2 (565 − 551.41)2
+ + + =17.31
298.6 103.99 551.41
2
I Reject if Xcalc > X(r2 −1)(c−1),α (slide 30).
I From X 2 table : X(r2 −1)(c−1),α = X(2−1)(3−1),0.05
2 2
= X2,0.05 = 5.99.
I 2
Since Xcalc 2
> X2,0.05 =⇒ 17.31 > 5.99 =⇒ H0 is
rejected.
I There is sufficient evidence to indicate a relationship between
treatment and incidence of Covid–19.

CHP 15 - STAT 245 Summer 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CHP 15 - STAT 245 Summer 2021

Uploaded by

Copyright:

Available Formats

Analysis of Categorical Data

4. The trials are independent.

I Recall (from Chapter 4 in PSY 233) that a binomial experiment

for all values of k = 0, 1, 2, ..., n.

Pearson’s Chi–Square Test X 2 is used :

summed over all k cells, with

I Find Xk2−1,α from Chi–Square table.

I Example : Do animals distinguish colors?? One experiment is

Green Red Blue

Observed Count (Oi ) 20 39 31

(30) (30) (30)

I Solution : If the snake has no preference on any door, then the

I To find the expected cell counts for each of the categories :

I Example : Osteosarcoma is one of the very malignant types of

Treatment Surgery Chemotherapy Radiation Amputation

Observed 294 276 238 192

Do the data present sufficient evidence to indicate that

I To find the expected cell counts for each of the categories :

Treatment Surgery Chemotherapy Radiation Amputation

Observed 294 276 238 192

I Example : Previous records of Covid–19 in a large town indicate

Covid–19 December 2020 August 2020 April 2020

Observed (Oi ) 329 43 128

Do these data indicate a departure from previous record

I To find the expected cell counts for each of the categories :

Covid–19 December 2020 August 2020 April 2020

Observed (Oi ) 329 43 128

(300) (25) (175)

I Example : Given the frequency table data in the table below,

Class Class fi xmi fi · xmi 2

Class Class fi xmi fi · xmi 2

Class Class fi z–transformed Area Ei Oi

1 89.5–104.5 24 −∞ to -1.11 p1 = A1 = 0.1335 E1 = 26.7 24

I The null and alternative hypotheses are

I To find the expected cell counts for each of the categories :

I To find the expected cell counts for each of the categories :

I Example : Given the frequency table data in the table below,

Class Class fi xmi fi · xmi 2

Class Class fi xmi fi · xmi 2

Class Class fi z–transformed Area Ei Oi

I The null and alternative hypotheses are

I To find the expected cell counts for each of the categories :

I To find the expected cell counts for each of the categories :

Class Class fi z–transformed Area Ei Oi

1 5.5–10.5 1 −∞ to -1.68 p1 = 0.0464 E1 = 0.929 1

(1 − 0.929)2 (2 − 1.871)2 (3 − 3.510)2

Pearson’s Chi–Square Test X 2 is used :

summed over all (m)(n) cells, with

I Find X(r2 −1)(c−1),α from Chi–Square table.

I Example : A researcher wanted to study the relationship

Men 640 450

Women 440 470

Can you conclude that gender and having Hernia are

Men 640 450 1090

Women 440 470 910

Total 1080 920

I To find the expected cell counts for each cell :

(O11 − E11 )2 (O12 − E12 )2 (O21 − E21 )2 (O22 − E22 )2

I Example : A survey was conducted on a new Covid–19 vaccine.

One Shot Two Shots Three Shots

No Covid–19 289 100 565

Do the data present sufficient evidence that the vaccine