You are on page 1of 38

Analysis of Categorical Data

Goodness of Fit
I The multinomial experiment :
1. The experiment consists of n identical trials.
2. The outcome of each trial falls into one of k
categories.
3. The probability that the outcome of a single trial falls
into a particular category, say category i, is pi and
remains constant from trial to trial. This probability
must be between 0 and 1, for each of the k
categories, and the sum of all k probabilities is

k
X
pi = 1 . (1)
i=1

4. The trials are independent.


5. The experimenter counts the observed number of
outcomes in each category, written as O1 , O2 ,...Ok ,
with
O1 + O2 + ... + Ok = n . (2)
Goodness of Fit

I Recall (from Chapter 4 in PSY 233) that a binomial experiment


consists of n identical trials with probability of success p on
each trial. The probability of k successes in n trials is :

n!
P(x = k) = pk q n−k (3)
k!(n − k )!

for all values of k = 0, 1, 2, ..., n.


I For a multinomial experiment, we have

n!
P(x1 , x2 , ..., xk ) = px1 px2 ...pkxk (4)
x1 !x2 !...xk ! 1 2
Goodness of Fit
I Suppose we want to hypothesize values for each of the
probabilities p1 , p2 , ...,pk as
H0 : P1 = p1 , P2 = p2 , ... , Pk = pk
vs H1 : At least one of the probabilties is different

Pearson’s Chi–Square Test X 2 is used :



I

k 2
X (Oi − Ei )
X2 = (5)
Ei
i=1

summed over all k cells, with

Ei = npi . (6)

I Find Xk2−1,α from Chi–Square table.


2 2
I Reject if Xcalc > Xk−1,α .
Question of Direct Relevance to Exam

I Example : Do animals distinguish colors?? One experiment is


designed to investigate whether snakes can distinguish colors
by being attracted to the end of a narrow solid, where it divides
to three different easy–to–open coloured doors. Observations
are taken for n = 90 snakes. Do snakes acquire a preference for
one of the three different coloured doors?

Green Red Blue

Observed Count (Oi ) 20 39 31

(30) (30) (30)


Solution

I Solution : If the snake has no preference on any door, then the


snake will choose each door an equal number of times. That is
1 1 1
H0 : P1 = , P2 = , P3 =
3 3 3
1
vs H1 : At least one pi is different from
3

I To find the expected cell counts for each of the categories :


 
1
E1 = np1 = (90) = 30
3
 
1
E2 = np2 = (90) = 30
3
 
1
E3 = np3 = (90) = 30
3
Solution
I

3 2
2
X (Oi − Ei )
Xcalc =
Ei
i=1
(O1 − E1 )2 (O2 − E2 )2 (O3 − E3 )2
= + +
E1 E2 E3
2 2
(20 − 30) (39 − 30) (31 − 30)2
= + +
30 30 30
= 6.067
2
I Reject if Xcalc > Xk2−1,α (slide 4).
I From X 2 table : Xk2−1,α = X3−1,0.05
2 2
= X2,0.05 = 5.99.
I 2
Since Xcalc 2
> X2,0.05 =⇒ 6.067 > 5.99 =⇒ H0 is
rejected.
I The hypothesis of no preference is rejected. There is sufficient
evidence to indicate that the snake has a preference for one of
the three doors.
Question of Direct Relevance to Exam

I Example : Osteosarcoma is one of the very malignant types of


cancer in which the cancerous tumor occurs in a bone. Patients
are often treated with one of the following four options :
“Surgery”, “Chemotherapy”, “Radiation Therapy”, and
“Amputation”. In a recent study, 1000 patients with
Osteosarcoma were asked to select/prefer the treatment
program to follow (three months). The following data, is about all
patient responses for four different types of treatment :

Treatment Surgery Chemotherapy Radiation Amputation


Type Therapy

Observed 294 276 238 192


Count (Oi )
() () () ()

Do the data present sufficient evidence to indicate that


some treatments are preferred over others? Test using α =
0.05.
Solution
I Solution : If the driver has no preference in the choice of the
treatment program, then each treatment type has to be chosen
an equal number of times. That is
1 1 1 1
H0 : P1 = , P2 = , P3 = , P4 =
4 4 4 4
1
vs H1 : At least one pi is different from
4

I To find the expected cell counts for each of the categories :


 
1
E1 = np1 = (1000) = 250
4
 
1
E2 = np2 = (1000) = 250
4
 
1
E3 = np3 = (1000) = 250
4
 
1
E4 = np4 = (1000) = 250
4
Solution

I Solution :

Treatment Surgery Chemotherapy Radiation Amputation


Type Therapy

Observed 294 276 238 192


Count (Oi )
(250) (250) (250) (250)
Solution

4 2
2
X (Oi − Ei )
Xcalc =
Ei
i=1
(O1 − E1 )2 (O2 − E2 )2 (O3 − E3 )2 (O4 − E4 )2
= + + +
E1 E2 E3 E4
(294 − 250)2 (276 − 250)2 (238 − 250)2 (192 − 250)2
= + + +
250 250 250 250
= 24.48
2
I Reject if Xcalc > Xk2−1,α (slide 4).
I From X 2 table : Xk2−1,α = X4−1,0.05
2 2
= X3,0.05 = 7.82.
I 2
Since Xcalc 2
> X3,0.05 =⇒ 24.48 > 7.82 =⇒ H is 0
rejected.
I The hypothesis of no preference is rejected.
Question of Direct Relevance to Exam

I Example : Previous records of Covid–19 in a large town indicate


that of the total number of patients who had Covid–19 during
2020, 60% are during December 2020, 5% are during August
2020, and the remainder were during April 2020. In a recent
study, 500 patients were involved, where 329 confirmed being
tested positively for Covid–19 during December 2020, 43 during
August 2020, and the remainder during April 2020.

Covid–19 December 2020 August 2020 April 2020

Observed (Oi ) 329 43 128

Do these data indicate a departure from previous record


rates? Test using α = 0.05.
Solution
I Solution : The hypothesis to be tested is :
H0 : P1 = 0.6, P2 = 0.05, P3 = 0.35
vs H1 : At least one pi is different

I To find the expected cell counts for each of the categories :


E1 = np1 = (500) (0.6) = 300
E2 = np2 = (500) (0.05) = 25
E3 = np3 = (500) (0.35) = 175

Covid–19 December 2020 August 2020 April 2020

Observed (Oi ) 329 43 128

(300) (25) (175)


Solution

3 2
2
X (Oi − Ei ) (O1 − E1 )2 (O2 − E2 )2 (O3 − E3 )2
Xcalc = = + +
Ei E1 E2 E3
i=1
(329 − 300)2 (43 − 25)2 (128 − 175)2
= + +
300 25 175
= 28.39
2
I Reject if Xcalc > Xk2−1,α (slide 4).
I From X 2 table : Xk2−1,α = X3−1,0.05
2 2
= X2,0.05 = 5.99.
I 2
Since Xcalc 2
> X2,0.05 =⇒ 28.39 > 5.99 =⇒ H is 0
rejected.
I There is a departure from previous record rates.
Question of Direct Relevance to Exam

I Example : Given the frequency table data in the table below,


compute the variance and standard deviation of the data using
the grouped data formula. Is the given population normally
distributed?? Use α = 0.05.
Frequency table

Class Class fi xmi fi · xmi 2


xm 2
fi · xm
i i
Boundaries
1 89.5–104.5 24
2 104.5–119.5 62
3 119.5–134.5 72
4 134.5–149.5 26
5 149.5–164.5 12
6 164.5–179.5 4
n
X n
X n
X
2
Sums fi = fi xmi = fi xm i
i=1 i=1 i=1
=
Solution
I Frequency table

Class Class fi xmi fi · xmi 2


xm 2
fi · xm
i i
Boundaries
1 89.5–104.5 24 97 2328 9409 225,816
2 104.5–119.5 62 112 6944 12,544 777,728
3 119.5–134.5 72 127 9144 16,129 1,161,288
4 134.5–149.5 26 142 3692 20,164 524,264
5 149.5–164.5 12 157 1884 24,649 295,788
6 164.5–179.5 4 172 688 29,584 118,336
n
X n
X Xn
2
Sums fi = 200 fi xmi fi xm i
i=1 i=1 i=1
=24,680 =3,103,220
Solution
I Now,
n
!2
X
n
fi · xmi
X i=1
fi · xm2 i −
n
i=1
s2 =
n−1
h 2
i
3, 103, 220 − 24680
200
=
200 − 1

and
=⇒s = √
290 = 17.03

n
X
fi xmi
i=1 24680
x̄ = = = 123.4
n 200
=⇒ s 17.03
CVar = × 100% = × 100% = 13.8%
x̄ 123.4
Solution
I Frequency table

Class Class fi z–transformed Area Ei Oi


Boundaries

1 89.5–104.5 24 −∞ to -1.11 p1 = A1 = 0.1335 E1 = 26.7 24


2 104.5–119.5 62 -1.11 to -0.23 p2 = A2 = 0.2755 E2 = 55.1 62
3 119.5–134.5 72 -0.23 to 0.65 p3 = A3 = 0.3332 E3 = 66.64 72
4 134.5–149.5 26 0.65 to 1.53 p4 = A4 = 0.1948 E4 = 38.96 26
5 149.5–164.5 12 1.53 to 2.41 p5 = A5 = 0.0550 E5 = 11.0 12
6 164.5–179.5 4 2.41 to +∞ p6 = A6 = 0.0080 E6 = 1.6 4
X n Xn Xn X n
Sums fi Pi Ei Oi
i=1 i=1 i=1 i=1
= 200 =1 = 200 = 200

n
!
X
where Ei = npi = fi Ai and Oi = fi .
i=1
Solution

I The null and alternative hypotheses are


H0 : The population is normally distributed.
vs H1 : The population is NOT normally distributed.

I To find the expected cell counts for each of the categories :


p1 = P (Z ≤ −1.11) = 0.1335
p2 = P (Z ≤ −0.23) − P (Z ≤ −1.11) = 0.2755
p3 = P (Z ≤ 0.65) − P (Z ≤ −0.23) = 0.3332
p4 = P (Z ≤ 1.53) − P (Z ≤ 0.65) = 0.1948
p5 = P (Z ≤ 2.41) − P (Z ≤ 1.53) = 0.0550
p6 = 1 − P (Z ≤ 2.41) = 0.0080
Solution

I To find the expected cell counts for each of the categories :


E1 = np1 = (200) (0.1335) = 26.7
E2 = np2 = (200) (0.2755) = 55.1
E3 = np3 = (200) (0.3332) = 66.64
E4 = np4 = (200) (0.1948) = 38.96
E5 = np5 = (200) (0.0550) = 11.0
E6 = np6 = (200) (0.0080) = 1.6
Solution

I =⇒X (O − E )
6 2
(24 − 26.7)2 (62 − 55.1)2 (72 − 66.64)2
2 i i
Xcalc = = + +
Ei 26.7 55.1 66.64
i=1
(26 − 38.96)2 (12 − 11)2 (4 − 1.6)2
+ + +
38.96 11 1.6
= 9.57.
2
I Reject if Xcalc > Xk2−1,α (slide 4).
I From X 2 table : Xk2−1,α = X6−1,0.05
2 2
= X5,0.05 = 11.07.
I 2
Since Xcalc 2
< X5,0.05 =⇒
9.57 < 11.07
=⇒ H0 is not rejected.
I The population is normally distributed.
Question of Direct Relevance to Exam

I Example : Given the frequency table data in the table below,


compute the variance and standard deviation of the data using
the grouped data formula. Is the given population normally
distributed?? Use α = 0.05.
Frequency table

Class Class fi xmi fi · xmi 2


xm 2
fi · xm
i i
Boundaries
1 5.5–10.5 1
2 10.5–15.5 2
3 15.5–20.5 3
4 20.5–25.5 5
5 25.5–30.5 4
6 30.5–35.5 3
7 35.5–40.5 2
n
X n
X n
X
2
Sums fi = fi xmi = fi xm i
i=1 i=1 i=1
=
Solution
I Frequency table

Class Class fi xmi fi · xmi 2


xm 2
fi · xm
i i
Boundaries

1 5.5–10.5 1 8 8 64 64
2 10.5–15.5 2 13 26 169 338
3 15.5–20.5 3 18 54 324 972
4 20.5–25.5 5 23 115 529 2645
5 25.5–30.5 4 28 112 784 3136
6 30.5–35.5 3 33 99 1089 3267
7 35.5–40.5 2 38 76 1444 2888
n
X n
X n
X
2
Sums fi = 20 fi xmi = 490 fi xm i
i=1 i=1 i=1
=13310
Solution
I Now,
n
!2
X
n
fi · xmi
X i=1
fi · xm2 i −
n
i=1
s2 =
n−1
h 2
i
490
13310 − 20 13310 − 12005
= = = 68.7
20 − 1 19
and
n
X
fi xmi
i=1 490
x̄ = = = 24.5
n 20
=⇒
s 8.3
CVar = × 100% = × 100% = 33.9%
x̄ 24.5
Solution
I Frequency table

Class Class fi z–transformed Area Ei Oi


Boundaries
1 5.5–10.5 1
2 10.5–15.5 2
3 15.5–20.5 3
4 20.5–25.5 5
5 25.5–30.5 4
6 30.5–35.5 3
7 35.5–40.5 2
Xn n
X n
X n
X
Sums fi pi Ei Oi
i=1 i=1 i=1 i=1
= 20 =1 = 20 = 20

n
!
X
where Ei = npi = fi pi and Oi = fi .
i=1
Solution

I The null and alternative hypotheses are


H0 : The population is normally distributed.
vs H1 : The population is NOT normally distributed.

I To find the expected cell counts for each of the categories :


p1 = A1 = P (Z ≤ −1.68) = 0.0464
p2 = A2 = P (Z ≤ −1.08) − P (Z ≤ −1.68) = 0.0935
p3 = A3 = P (Z ≤ −0.48) − P (Z ≤ −1.08) = 0.1755
p4 = A4 = P (Z ≤ 0.12) − P (Z ≤ −0.48) = 0.2321
p5 = A5 = P (Z ≤ 0.72) − P (Z ≤ 0.12) = 0.2164
p6 = A6 = P (Z ≤ 1.32) − P (Z ≤ 0.72) = 0.1423
p7 = A7 = 1 − P (Z ≤ 1.32) = 0.0934
Solution

I To find the expected cell counts for each of the categories :


E1 = np1 = (20) (0.046) = 0.929
E2 = np2 = (20) (0.093) = 1.871
E3 = np3 = (20) (0.175) = 3.510
E4 = np4 = (20) (0.232) = 4.642
E5 = np5 = (20) (0.216) = 4.329
E6 = np6 = (20) (0.142) = 2.846
E7 = np7 = (20) (0.093) = 1.868
Solution
I Frequency table

Class Class fi z–transformed Area Ei Oi


Boundaries

1 5.5–10.5 1 −∞ to -1.68 p1 = 0.0464 E1 = 0.929 1


2 10.5–15.5 2 -1.68 to -1.08 p2 = 0.0935 E2 = 1.871 2
3 15.5–20.5 3 -1.08 to -0.48 p3 = 0.1755 E3 = 3.510 3
4 20.5–25.5 5 -0.48 to 0.12 p4 = 0.2321 E4 = 4.642 5
5 25.5–30.5 4 0.12 to 0.72 p5 = 0.2164 E5 = 4.329 4
6 30.5–35.5 3 0.72 to 1.32 p6 = 0.1423 E6 = 2.846 3
7 35.5–40.5 2 1.32 to ∞ p7 = 0.0934 E2 = 1.868 2
X n Xn Xn X n
Sums fi pi Ei Oi
i=1 i=1 i=1 i=1
= 20 ≈1 ≈20 = 20

n
!
X
where Ei = npi = fi pi and Oi = fi .

=⇒X (O − E ) 7 2
i=1

(1 − 0.929)2 (2 − 1.871)2 (3 − 3.510)2


2 i i
Xcalc = = + +
Ei 0.929 1.871 3.510
i=1
(5 − 4.642)2 (4 − 4.329)2 (3 − 2.846)2 (2 − 1.868)2
+ + + +
4.642 4.329 2.846 1.868
= 0.1598
Solution

2
I Reject if Xcalc > Xk2−1,α (slide 4).
I From X 2 table : Xk2−1,α = X7−1,0.05
2 2
= X6,0.05 = 12.592.
I 2
Since Xcalc 2
<< X6,0.05 =⇒ 0.1592 << 12.592 =⇒
H0 is not rejected (accepted strongly).
I The population is normally distributed.
Contingency Tables
I Suppose we want to hypothesize values for each of the
probabilities p11 , p12 , ...,pmn as
H0 : P11 = p11 , P12 = p12 , ... , Pmn = pmn
vs H1 : At least one of the probabilties is different

Pearson’s Chi–Square Test X 2 is used :



I

m X
n 2
2
X Oij − Eij
X = (7)
Eij
i=1 j=1

summed over all (m)(n) cells, with

ri cj
Eij = . (8)
n

I Find X(r2 −1)(c−1),α from Chi–Square table.


2
I Reject if Xcalc > X(r2 −1)(c−1),α .
Question of Direct Relevance to Exam

I Example : A researcher wanted to study the relationship


between gender and having Hernia. He took a sample of 2000
patients from a big database, for patients who had sharp
abdomen pain. He obtained the information given in the
following table :

Hernia No Hernia

Men 640 450

Women 440 470

Can you conclude that gender and having Hernia are


related for all adults? Test using α = 0.05.
Solution

I Solution :
H0 : Gender and Hermia are Independent
vs H1 : Gender and Hernia are not Independent

Hernia No Hernia

Men 640 450 1090


(588.6) (501.4)

Women 440 470 910


(491.4) (418.6)

Total 1080 920


Solution

I To find the expected cell counts for each cell :


r1 c1 (1090)(1080)
E11 = = = 588.6
n 2000
r1 c2 (1090)(920)
E12 = = = 501.4
n 2000
r2 c1 (910)(1080)
E21 = = = 491.4
n 2000
r2 c2 (910)(920)
E22 = = = 418.6
n 2000
Solution
I To compute X 2 value :
X Oij − Eij 2

2
Xcalc =
Eij
i,j

(O11 − E11 )2 (O12 − E12 )2 (O21 − E21 )2 (O22 − E22 )2


= + + +
E11 E12 E21 E22
2 2 2
(640 − 588.6) (450 − 501.4) (440 − 491.4)
= + +
588.6 501.4 491.4
2
(470 − 418.6)
+ = 4.489 + 5.269 + 5.376 + 6.311 = 21.445
418.6
2
I Reject if Xcalc > X(r2 −1)(c−1),α (slide 30).
I From X 2 table : X(r2 −1)(c−1),α = X(2−1)(2−1),0.05
2 2
= X1,0.05 = 3.841.
I 2
Since Xcalc 2
> X1,0.05 =⇒ 21.445 > 3.841 =⇒
H0 is
rejected.
I Gender and Hernia are not independent. They are related.
Question of Direct Relevance to Exam

I Example : A survey was conducted on a new Covid–19 vaccine.


The vaccine was delivered in a three–shot sequence over a
period of three weeks. Some people received three shots, or
appeared for only the second shot, others received only one
shot. The following table gives the records of 1000 residents in a
small community.

One Shot Two Shots Three Shots

Covid–19 24 9 13

No Covid–19 289 100 565

Do the data present sufficient evidence that the vaccine


was successful in reducing the number of Covid–19 cases
in the small community? Test using α = 0.05.
Solution

I Solution :
H0 : Vaccine and Covid–19 are Independent
vs H1 : Vaccine reduces Covid–19 Cases

One Shot Two Shots Three Shots

Covid–19 24 9 13 46
(14.4) (5.01) (26.59)

No Covid–19 289 100 565 954


(298.6) (103.99) (551.41)

Total 313 109 578 1000


Solution

I To find the expected cell counts for each cell :


r1 c 1 (46)(313)
E11 = = = 14.4
n 1000
r1 c 2 (46)(109)
E12 = = = 5.01
n 1000
r1 c 3 (46)(578)
E13 = = = 26.59
n 1000
r2 c 1 (954)(313)
E21 = = = 298.6
n 1000
r2 c 2 (954)(109)
E22 = = = 103.99
n 1000
r2 c 3 (954)(578)
E23 = = = 551.41
n 1000
Solution
I To compute X 2 value :
X Oij − Eij 2

2 (O11 − E11 )2 (O12 − E12 )2 (O13 − E13 )2
Xcalc = = + +
Eij E11 E12 E13
i,j

(O21 − E21 )2 (O22 − E22 )2 (O23 − E23 )2


+ + +
E21 E22 E23
2 2
(24 − 14.4) (9 − 5.01) (13 − 26.59)2
= + +
14.4 5.01 26.59
(289 − 298.6)2 (100 − 103.99)2 (565 − 551.41)2
+ + + =17.31
298.6 103.99 551.41
2
I Reject if Xcalc > X(r2 −1)(c−1),α (slide 30).
I From X 2 table : X(r2 −1)(c−1),α = X(2−1)(3−1),0.05
2 2
= X2,0.05 = 5.99.
I 2
Since Xcalc 2
> X2,0.05 =⇒ 17.31 > 5.99 =⇒ H0 is
rejected.
I There is sufficient evidence to indicate a relationship between
treatment and incidence of Covid–19.

You might also like