Professional Documents
Culture Documents
5
Categorical data analysis
Definition: Categorical data consists of variables with a
finite number of values – counts rather than
measurements.
6
Examples
Categorical data arise in a number of ways
– Binary variables - yes/no, pass/fail, live/die
– Unordered multinomial: Christian, Muslim, Protestant,
Catholic
– Ordinal:
• No education, Elementary, Secondary, College.
• social class: upper, middle, lower
• Imperfect scale measurement: e.g., Likert's 5-point
scale
• Grouped variables: e.g., income in bands
7
The analysis of frequency tables
8
The analysis of frequency tables
9
1. Chi-Square Test
• A Chi-Square (χ2) is a probability
distribution used to make statistical
inferences about categorical data
(proportions) in which the numbers of
categories are two or more.
• Widely used in the analysis of
contingency tables.
• Chi-Square test allows us to test for
association between two categorical
variables.
Ho: No association between the variables.
HA: There is association
• Consequently a significant p-value implies
association.
• Chi-Square test compares observed to
expected counts (frequencies) under the
assumption of no association (or Ho is
true)
Disease
Yes No
Exposure Yes a b a+b=n1
No c d c+d=n2
a+c =m1 b+d =m2
13
The general case – the r × c table
Table 1: Source of water and birth weight, Sub-sample
from Jimma Infant Survival data
Birth weight
< 2,500- 3,000- 3,500 Total
Water source 2,500 2,999 3,499 +
14
2 × 2 contingency table
Example: Consider the following sub-sampled data from the Jimma
Infant Survival study to look into differences in the proportion of low
birth weight babies between urban and rural residents.
15
SPSS output – Chi-squared test
17
Definition
• Chi-Square test is a statistic which
measures the discrepancy between k
observed frequencies O1, O2,…Ok and the
corresponding expected frequencies E1,
E2,… Ek.
• When the Ho of no association is true, the observed and expected
counts will be similar, their difference will be close to zero, resulting
in a SMALL chi square statistic value.
• When the HA of an association is true the observed counts will be
unlike the expected counts, their difference will be non zero and
their squared difference will be positive, resulting in a LARGE
POSITIVE chi square statistic value.
• Chi-Square test is based on the table of Χ2
for different degrees of freedom (df).
• Requires 2x2 table
• If the value of χ2 is zero, no discrepancy
between the observed and the expected
frequencies.
• The greater the discrepancy, the larger will
be the value of χ2.
• The calculated value of χ2 is compared
with the tabulated value for the given df.
Degrees of Freedom
• Counts in the Chi-Square Test of a 2x2
table are represented as “a”, “b”, “c” and
“d”.
• The general calculation:
Expected Value
• Is the product of the row total multiplied by
the column total, divided by the grand total
18.307 210
Contingency Table
• A table composed of rows cross-classified
by columns
• A 2x2 contingency table is a table
composed of two rows cross-classified by
two columns
• Appropriate to display data that can be
classified by two different variables, each
of which has only two possible outcomes
Cont…
The use of chi squared distribution for the test
statistic X2 is based on a ‘large sample’
approximation.
28
Chi-square requirements
The sample must be randomly drawn from the population.
Measured variables must be independent;
Values/categories on independent and dependent variables
must be mutually exclusive
At least 80% of the cells should have expected count/
frequencies greater than 5
All cells should have expected frequencies greater than 1
Data must be reported in raw frequencies (not percentages);
All the data in the sample must be used.
Observed frequencies cannot be too small.
29
Types
30
31
Ex. Rolling a die 60 times
1 2 3 4 5 6
obs 6 8 12 15 14 5
exp 10 10 10 10 10 10
(obs-exp) -4 -2 2 5 4 -5
(obs-exp)2 16 4 4 25 16 25
33
34
Test of Independence
• Test of Independence – two categorical
variables are involved, and the observed
and expected frequencies are compared.
Here the expected frequencies are those
the researcher would expect if the two
variables were independent of each other.
35
Cont…
observed
C1 C2
R1 A B A+B
R2 C D C+D
expected
36
Example
• Is the digital rectal exam result (DRE)
independent of the biopsy result (BIOP)?
• OBSERVED
DRE+ DRE-
BIOP+ 50 20 70
BIOP- 10 20 30
60 40 100
37
Solution
O E 50 20 70
50 E1 10 20 30
10 E2 60 40 100
20 E3
38
Cont…
O E (O-E) (O-E)2 (O-E)2/E
50 42 8 64 1.52
10 18 -8 64 3.56
20 28 -8 64 2.29
20 12 8 64 5.33
12.7
MI status over
OC-use 3 years
group Yes No Total
61
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS
A B (A + B)
C D (C + D)
(A + C) (B + D) n
65
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS
Sample Problem
A randomly selected group of 120 students taking a
standardized test for entrance into college exhibits a
failure rate of 50%. A company which specializes in
coaching students on this type of test has indicated that it
can significantly reduce failure rates through a four-hour
seminar. The students are exposed to this coaching
session, and re-take the test a few weeks later. The
school board is wondering if the results justify paying this
firm to coach all of the students in the high school. Should
they? Test at the 5% level.
68
MCNEMAR’S TEST FOR CORRELATED (DEPENDENT)
PROPORTIONS
Sample Problem
The summary data for this study appear as follows:
Pass 56 56 112
After
Fail 4 4 8
60 70
60 120
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS
Step II : 0.05
71
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS
72
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS
73
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS
X² Distribution
0 X²
3.84
74
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS
75
Trend test in 2 x c tables
79
SPSS output – Chi squared test
Chi-Square Tests Value df Asymp. Sig. (2-sided)
80
Cont…
The total variation among the groups can be
subdivided into that due to:
- a trend in proportions across the groups and
- The remainder
10
8
6
4
2
0
Illiterate Elemetary Junior Senior HS &
above
Educational status of mothers
82
For the above data on birth weight and education:
The standard X2 = 19.6 with 3 degree of freedom with p <
2
0.001, X Trend 19.4 and on 1 degree of freedom (remember
the df for linear regression line between two variables) p <
0.001.
83
Trend test in 2 x c(column) tables
Salt intake
Low Regular High Total
O E O E O E
Hypertension 18 38.5 54 54.1 78 57.4 150
86
CONT…
Notation used
k k k
N ni ,R r i ,P R / N , X ni xi
i 1 i 1 i 1
N
Score (xi) 1 2 3
Nixi
03/14/24 19:26 118 Biostatistics664
Advaced MU SoPH 1584 2366 88
Solution
x 2.13
• ∑rixi= 360
• R= 150
• P= 0.326
k
[ r i x i
Rx ]2
x 2
trend i 1
k
5.39
P (1 P )[ r i x i
Rx ]
i 1
Yes a b a+b
No c d c+d
Total a+c b+d N
RR = a/a+b
c/c+d
1st Give Breast Cancer
Birth Yes No Total
RR = a/a+b 96
Total 6072 6168
c/c+d
a/a+b = 31/1628 = 0.019
b/b+d = 65/4540 = 0.014
or
(0.89, 2.08)
• This interval contains the value 1
3. The Odds Ratio
• The odds ratio (OR) is the odds in favor of
disease for the exposed group divided by
the odds in favor of disease for the
unexposed group
• The odds in favor of disease = p/(1-p),
where p = probability of a disease
• Odds = Pr (event occurs) / Pr (event does
not occur) = p/(1-p)
• The odds ratio defined as:
=
• Is estimated by
Example:
• In a study of the risk factors for invasive
cervical cancer, the following data were
collected (Case-Control):
• The odds ratio is estimated by:
ˆ 1 1 1 1 ˆ 1 1 1 1
ln OR Z ln OR Z
a b c d a b c d
e ,e
• For the cervical cancer data,
or
(1.10, 2.13)
• This interval does not contain the value 1
• We conclude that the odds of developing
cervical cancer are significantly higher for
smokers than for nonsmokers
Example: Odds of Death
Related to Vit A use (Case-Control Study)
• What is the estimated OR?
• Estimated OR = (46/61)/(74/59)=0.60
• 95% CI = (0.36, 1.04)
2. Quantitative Data
• Previously we focused on measures of the
strength of association between two
dichotomous random variables
220
200
180
160
140
120
100
80 wt (kg)
60 70 80 90 100 110 120
200
180
160
140
120
100
80
Wt (kg)
60 70 80 90 100 110 120
16
14
12
Height in CM
10
0
0 10 20 30 40 50 60 70 80 90
14 March 2024 Age in Weeks
119
Negative relationship
120
Reliability
Age of Car
14 March 2024
No relation
(Xi X)(Yi Y) XY [ X Y ] / n
r
(Xi X) (Yi Y)
2 2
[ X ( X ) / n][ Y ( Y ) / n]
2 2 2 2
127
serial Age Weight
No (years) (Kg)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
14 March 2024 6 9 13
These 2 variables are of the quantitative type, one
variable (Age) is called the independent and
denoted as (X) variable and the other (weight)
is called the dependent and denoted as (Y)
variables to find the relation between age and
weight compute the simple correlation coefficient
using the following formula:
xy x y
r n
( x) 2 ( y)2
x
2 . y
2
n n
r = 0.759
strong direct correlation
131
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 =∑ Y2 =∑ XY=12∑
14 March 2024 230 204 9
Calculating Correlation Coefficient
r = - 0.94
6 (di) 2
rs 1
n(n 2 1)
136
sample level of injury Income
numbers (X) (Y)
A moderate. 25
B mild. 10
C fatal. 8
D Sever. 10
E Sever. 15
F Normal 50
G
14 March 2024
fatal. 60
Answer:
Rank Rank di di2
(X) (Y) X Y
A moderate. 25 5 3 2 4
137
B mild. 10 6 5.5 0.5 0.25
C fatal. 8 1.5 7 -5.5 30.25
D Sever. 10 3.5 5.5 -2 4
E Sever. 15 3.5 4 -0.5 0.25
F Normal 50 7 2 5 25
G fatal. 60 1.5 1 0.5 0.25
∑ di2=64
14 March 2024
6 64
rs 1 0.1
7(48)
Comment:
There is an indirect weak correlation
between level of injury and income.
i
14 March 2024
Xi X
144
Linear Regression - Model
Yi X i i Population
Yˆ = b0 + b1Xi + e
Sample
ˆ = b0 + b1Xi
Y
y= ++xx or
++ or y|xy|x==++xx
146
y=
Nonrandomor
Nonrandom or Random
Random
Systematic
Systematic Component
Component
Component
Component
•• Where
Where
•• yyisisthe
thedependent
dependent(response)
(response) variable,
variable,the
thevariable
variablewe
wewish
wishto
toexplain
explain
orpredict;
or predict;
•• xxisisthe
theindependent
independent(explanatory)
(explanatory)variable,
variable,also
alsocalled
calledthe
thepredictor
predictor
variable;and
variable; and
•• isisthe
theerror
errorterm,
term,the
theonly
onlyrandom
randomcomponent
componentininthe
themodel,
model,and
andthus,
thus,
theonly
the onlysource
sourceofofrandomness
randomnessininy.y.
14 March 2024
Cont…
• y|x is the mean of y when x is specified,
all called the conditional mean of Y.
148
expected
Y,
Y,
•• the
the dependent
dependent variablevariable Y, Y, and
and
y|x= + x
X,the
X, theindependent
independentor orpredictor
predictor
{
y
variable:
Error: } = Slope variable:
y|xy|x==++xx
}
1 Actualobserved
Actual observedvalues
valuesofofYY(y)
(y)
differfrom fromthe
theexpected
expectedvalue
value
{
differ
((y|xy|x))by
byan
anunexplained
unexplainedor or
= Intercept randomerror(
random error():):
X yy== y|xy|x ++
0 x
==++xx++
14 March 2024
Assumptions of the Simple Linear
Regression Model
•• The
Therelationship
relationshipbetween
betweenXX
LINE assumptions of the Simple
andYYisisaastraight-Line
and straight-Line(linear)
(linear) Y
Linear Regression Model
relationship.
relationship.
149
•• Thevalues
The valuesof ofthe
theindependent
independent
variableXXare
variable areassumed
assumedfixedfixed
(notrandom);
(not random);the theonly
only
y|x= + x
randomnessininthe
randomness thevalues
valuesofofYY
comesfrom
comes fromthetheerror term..
errorterm
•• The errorsare
Theerrors areuncorrelated
uncorrelated
y
(i.e.Independent)
(i.e. Independent)inin
successiveobservations.
successive observations.The The
errorsare
errors areNormally
Normally Identical normal
distributedwith
distributed withmean
mean00and and distributions of errors,
all centered on the
variance22(Equal
variance (Equalvariance).
variance).
N(y|x, y|x2) regression line.
•• Thatis:
That is: ~~N(0,
N(0,22))
X
14 March 2024 x
Fitting a Regression Line
Y Y
150
Data
Three errors from the
least squares
X regression line X
Y e
151
yˆ a bx
yi . the fitted regression line
yˆi
Error ei yi yˆi
{ yˆ the predicted value of Y for x
X
xi
14 March 2024
Sums of Squares, Cross Products, and
Least Squares Estimators
Sums of Sq uares and Cross Products:
(x x ) 2 x 2
x
2
lxx
n 2
lyy (y y ) y
2 2
y
n
x ( y )
lxyŷ a
bx ( x x )( y y ) xy
n
Least squares re gression estimators:
lxy
b lxx
ŷ a bx
a y bx
14 March 2024 152
Example
x2
2
2
Patient x y x2 y2 x ×y 592.6 2
x
xx2 n 41222.14
2
1 22.4 134.0 501.76 17956.0 3001.60 lxxlxx 41222.14592.6 6104.66
6104.66
4 25.1 80.2 630.01 6432.0 2013.02 n 10
10
y2
8 32.4 97.2 1049.76 9447.8 3149.28 2
2 y 1428.702 2
2
3
51.6
58.1
167.0
132.3
2662.56
3375.61
27889.0
17503.3
8617.20
7686.63 l yyl yyyy2 n 220360.47
220360.471428.70
16242.10
16242.10
5 65.9 100.0 4342.81 10000.0 6590.00 n 1010
7 75.3 187.2 5670.09 35043.8 14096.16
xyy
xy n 91866.46
x
592.61428.70
592.6 1428.70
6 79.7 139.1 6352.09 19348.8 11086.27 lxyxy
l xy 91866.46
10 7201.70
7201.70
10 85.7 199.4 7344.49 39760.4 17088.58 n 10
9 96.4 192.3 9292.96 36979.3 18537.72
7201.70
l lxy 7201.70
bb xylxx 1.18
1.18
Total 592.6 1428.7 41222.14 220360.5 91866.46
lxx 6104.66
6104.66
592.6
regression equation: aayybx
bx 1428.7
1428.7
10 (1.18)592.6
(1.18)
10
10
10
yˆ 72.96 1.18 x 72.96
72.96
SSR
Due to regression.
SST
Random/unexplained.
_
SSR = (Yi - Y)2
_
Y
Xi X
14 March 2024 155
. regress weight age
175
150
125
100
75
50
25
0
0 25 50 75 100 125
Percentage im m unized
179
Procedure:
6 (di) 2
rs 1
n(n 2 1)
186
B. Linear Regression
• A regression is a description of a response
measure, Y, the dependent variable, as a
function of an explanatory variable, X, the
independentvariable.
• The goal is the prediction or estimation of
the value of one variable, Y, based on the
value of the other variable, X.
• Simple: One predictor variable (X) used to
predict response (Y)
• Where
– where y is the outcome variable and x 1, x2, . . .xk
are the values of k distinct explanatory variables
. regress weight age order cid mid
.
F-distribution
• Used for comparing the variances of two
populations.
• It is defined in terms of ratio of the variances
of two normally distributed populations. So it
is sometimes also called variance ratio.
• F-distribution: if the assumption met,
• (s12/σ12)/(s22/σ22)
s12 = ∑(x1- x⁻1)2/(n1-1)
s22 = ∑(x2- x⁻2)2/(n2-1)
14 March 2024 204
Cont…
• Degrees of freedom: v1=n1 – 1, v2 = n2 – 1
• For different values of v1 and v2 we will get
different distributions, so v1 and v2 are
parameters of F distribution.
• If σ12 = σ22, then, the statistic F = s12/ s22
follows F distribution with n1 – 1 and n2 –
1 degrees of freedom.
213
Analysis of Variance
214
Types of Experimental Designs
Experimental
Designs
One-Way Two-Way
Anova Anova
215
Completely Randomized
Design
216
Completely Randomized
Design
1. Experimental Units (Subjects) Are
Assigned Randomly to Treatments
– Subjects are Assumed Homogeneous
2. Variables
– One categorical Independent Variable
– One Continuous Dependent Variable
219
Assumptions
220
Hypotheses
H0: 1 = 2 = 3 = ... = p
– All Population Means are Equal
– No Treatment Effect
221
Hypotheses
H0: 1 = 2 = 3 = ... = p
– All Population Means f(X)
are Equal
– No Treatment Effect
X
1 = 2 = 3
Ha: Not All j Are Equal
– At Least 1 Pop. Mean
is Different f(X)
– Treatment Effect
1 = 2 = ... = p X
– Or i ≠ j for some i, j. 1 = 2 3
222
One-Way ANOVA Basic Idea
1. Compares 2 Types of Variation to Test
Equality of Means
2. If Treatment Variation Is Significantly
Greater Than Random Variation then
Means Are Not Equal
3.Variation Measures Are Obtained by
‘Partitioning’ Total Variation
223
One-Way ANOVA
Partitions Total Variation
224
One-Way ANOVA
Partitions Total Variation
Total variation
225
One-Way ANOVA
Partitions Total Variation
Total variation
Variation due to
treatment
226
One-Way ANOVA
Partitions Total Variation
Total variation
227
One-Way ANOVA
Partitions Total Variation
Total variation
231
Total Variation
SS Total x11 x 2 x21 x 2 xij x 2
2 2 2
Response, x
x
Response, x
x3
x
x2
x1
Response, x
x3
x2
x1
SST / p 1
• 1. Test Statistic
– F = MST / MSE= V.R.
• MST Is Mean Square for Treatment
SSE / n p
• MSE Is Mean Square for Error
• 2. Degrees of Freedom
1 = p -1
2 = n - p
• p = # of treatment groups, or Levels
• n = Total Sample Size
235
One-Way ANOVA
Summary Table
Source of Degrees Sum of Mean F
Variation of Squares Square
Freedom (Variance)
Treatment p-1 SST MST = MST
SST/(p - 1) MSE
Error n-p SSE MSE =
SSE/(n - p)
Total n-1 SS(Total) =
SST+SSE
236
One-Way ANOVA F-Test Critical
Value
If means are equal,
F = MST / MSE 1.
Only reject large F! Reject H0
Do Not
Reject H0
0 F
Fa ( p1, np)
Always One-Tail!
© 1984-1994 T/Maker Co.
237
HOW TO CALCULATE ANOVA’S BY HAND…
Treatment 1 Treatment 2 Treatment 3 Treatment 4
y11 y21 y31 y41 n=10 obs./group
y12 y22 y32 y42
y13 y23 y33 y43 k=4 groups
y14 y24 y34 y44
y15 y25 y35 y45
y16 y26 y36 y46
y17 y27 y37 y47
y18 y28 y38 y48
y19 y29 y39 y49
y110 y210 y310 y410
10
y
10 10 10
( y1 j y1 ) 2
j 1
2j y 2 ) 2
j 1
( y 3 j y 3 ) 2 (y
j 1
4j y 4 ) 2
The (within)
j 1
10 1 10 1 10 1 10 1 group variances
238
Sum of Squares Within (SSW), or Sum
of Squares Error (SSE)
10 10
(y
10 10
(y (y
2
y 2 )
(y y 4 ) 2
2
1j y1 ) 2 2j 3j y 3 ) 4j
j 1 j 1 j 1
The (within)
j 1
group variances
10 1 10 1 10 1 10 1
10 10
(y
10 10
(y + y 4 ) 2
2
1j y1 ) 2 ( y 2 j y 2 ) 2 + ( y 3 j y 3 ) + 4j
j 1 j 3 j 1
j 1
4 10
i 1 j 1
( y ij y i ) 2 Sum of Squares Within (SSW)
(or SSE, for chance error)
239
Sum of Squares Between (SSB), or
Sum of Squares Regression (SSR)
4 10
Overall mean of
all 40 y
i 1 j 1
ij
observations
(“grand mean”) y
40
(y
Sum of Squares Between
2
10 x y )
(SSB). Variability of the
group means compared to
i the grand mean (the
i 1 variability due to the
treatment).
240
Total Sum of Squares (SST)
Squared difference of every
241
Partitioning of Variance
4 10 4 4 10
( y
i 1 j 1
ij y i ) 2
+10x ( y i y ) 2
= ( y ij y ) 2
i 1 i 1 j 1
242
ANOVA Table
Mean Sum
Source of Sum of
of Squares
variation d.f. squares
F-statistic p-value
244
Example
67 52 49 67
(60-62) 2+(67-62) 2+ (42-62) 42 43 50 54
2
+ (67-62) 2+ (56-62) 2+ (62-
67 67 55 67
62) 2+ (64-62) 2+ (59-62) 2+
56 67 56 68
(72-62) 2+ (71-62) 2+ (50-
59.7) 2+ (52-59.7) 2+ (43- 62 59 61 65
246
Fill in the ANOVA table
Total 39 2257.1
247
Fill in the ANOVA table
Total 39 2257.1
INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
248
Coefficient of Determination
2 SSB SSB
R
SSB SSE SST
The amount of variation in the outcome variable (dependent
variable) that is explained by the predictor (independent
variable).
249