10measures of Association

Measures of Association
• While a test of hypothesis can be used to

determine whether an association exists
between two random variables, it cannot
provide a measure of the strength of the
association
• Several methods are available for
estimating the magnitude of the effect
given the categorical data in a 2× 2
contingency table
• For the most part, we have been applying
the techniques of hypothesis testing to
either continuous or ordinal data
• What about nominal data?
• Instead of using the normal approximation
to the binomial distribution, we could reach
the same conclusion using different
techniques
1. Categorical Data
Analysis of categorical data
Brainstorming questions:
1. What is categorical data and what makes it different

from other data sets?
2. What are the different methods one can use to

analyze such data?
3. Discuss on the chi-square test and the requirements

associated with it?
5
Categorical data analysis
Definition: Categorical data consists of variables with a
finite number of values – counts rather than
measurements.
Categorical variables have two primary types of scales.

Variables having categories without a natural ordering
are called nominal.
Many categorical variables do have ordered categories.

Such variables are called ordinal.
Ordinal variables have ordered categories, but
distances between categories are unknown.
6
Examples
Categorical data arise in a number of ways
– Binary variables - yes/no, pass/fail, live/die
– Unordered multinomial: Christian, Muslim, Protestant,
Catholic
– Ordinal:
• No education, Elementary, Secondary, College.
• social class: upper, middle, lower
• Imperfect scale measurement: e.g., Likert's 5-point
scale
• Grouped variables: e.g., income in bands
7
The analysis of frequency tables
Proportions are a way of expressing counts or

frequencies when there are only two possible
outcomes, such as the presence or absence of a
symptom.
A more general way of showing frequencies is in a

table, where each cell of the table corresponds to
a particular combination of characteristics relating
to two or more classification.
8
The analysis of frequency tables
There is a single, general approach to the

analysis of all frequency tables, but in
practice the method of analysis varies
according to
– The number of categories
– Whether the categories are ordered or not
– The number of independent groups of
subjects, and
– The nature of the question being asked
9
1. Chi-Square Test
• A Chi-Square (χ2) is a probability
distribution used to make statistical
inferences about categorical data
(proportions) in which the numbers of
categories are two or more.
• Widely used in the analysis of
contingency tables.
• Chi-Square test allows us to test for
association between two categorical
variables.
Ho: No association between the variables.
HA: There is association
• Consequently a significant p-value implies
association.
• Chi-Square test compares observed to
expected counts (frequencies) under the
assumption of no association (or Ho is
true)
• With this method, data are arranged in the

form of a contingency table.
The 2 × 2 frequency table – comparison of two
proportions
We often interested in exposure-disease

relationship as shown in the following table.
Disease
Yes No
Exposure Yes a b a+b=n1
No c d c+d=n2
a+c =m1 b+d =m2
13
The general case – the r × c table
Table 1: Source of water and birth weight, Sub-sample
from Jimma Infant Survival data
Birth weight
< 2,500- 3,000- 3,500 Total
Water source 2,500 2,999 3,499 +
Pipe 14 5 147 117 335
Other protected 34 98 202 138 472
Unprotected source 107 220 292 167 786
Total 155 375 641 422 1593
14
2 × 2 contingency table
Example: Consider the following sub-sampled data from the Jimma
Infant Survival study to look into differences in the proportion of low
birth weight babies between urban and rural residents.
Low birth weight Total

Residence
YES NO
Urban 23 540 563
Rural 90 532 622
Total 113 1072 1185
15
SPSS output – Chi-squared test
Asymp. Exact Exact

Sig. (2- Sig. (2- Sig. (1-
Value df sided) sided) sided)
Pearson Chi-Square
36.939(
b) 1 0.000
Continuity Correction (a) 35.745 1 0.000
Likelihood Ratio 39.581 1 0.000
Fisher's Exact Test 0.000 0.000
Linear-by-Linear Association
36.908 1 0.000
N of Valid Cases 1185
(a) Computed only for a 2x2 table
(b) 0 cells (.0%) have expected count less than 5. The minimum expected count is
53.69. 16
Cont…
The analysis of frequency tables is largely on
hypothesis testing.
The null hypothesis is that the two classification

variables (water source and birth weight) are unrelated
in the relevant population (population from which the
sample is selected).
We compare the observed frequencies with what

we would expect if the null hypotheses was true.
17
Definition
• Chi-Square test is a statistic which
measures the discrepancy between k
observed frequencies O1, O2,…Ok and the
corresponding expected frequencies E1,
E2,… Ek.
• When the Ho of no association is true, the observed and expected
counts will be similar, their difference will be close to zero, resulting
in a SMALL chi square statistic value.
• When the HA of an association is true the observed counts will be
unlike the expected counts, their difference will be non zero and
their squared difference will be positive, resulting in a LARGE
POSITIVE chi square statistic value.
• Chi-Square test is based on the table of Χ2
for different degrees of freedom (df).
• Requires 2x2 table
• If the value of χ2 is zero, no discrepancy
between the observed and the expected
frequencies.
• The greater the discrepancy, the larger will
be the value of χ2.
• The calculated value of χ2 is compared
with the tabulated value for the given df.
Degrees of Freedom
• Counts in the Chi-Square Test of a 2x2
table are represented as “a”, “b”, “c” and
“d”.
• The general calculation:
Expected Value
• Is the product of the row total multiplied by
the column total, divided by the grand total
• The expected numbers must be computed for

each cell.
X Distribution
2
• Indexed by the degrees of freedom (n)

• Unlike z and t distributions, which are always
symmetric about 0, the X2 distribution only
takes on positive values and is always
skewed to the right.
• The skewness diminishes as n increases
Rejection
Acceptance region
region
0,95
0.05
18.307 210
Contingency Table
• A table composed of rows cross-classified
by columns
• A 2x2 contingency table is a table
composed of two rows cross-classified by
two columns
• Appropriate to display data that can be
classified by two different variables, each
of which has only two possible outcomes
Cont…
The use of chi squared distribution for the test
statistic X2 is based on a ‘large sample’
approximation.
The guidelines are that 80% of the cells in the

table should have expected frequencies greater
than 5, and all cells should have expected
frequencies greater than 1.
Note that the observed frequencies are not

involved here, only the expected frequencies.
28
Chi-square requirements
 The sample must be randomly drawn from the population.
 Measured variables must be independent;
 Values/categories on independent and dependent variables
must be mutually exclusive
 At least 80% of the cells should have expected count/
frequencies greater than 5
 All cells should have expected frequencies greater than 1
 Data must be reported in raw frequencies (not percentages);
 All the data in the sample must be used.
 Observed frequencies cannot be too small.
29
Types
•Goodness of Fit of a single variable

•Test of Independence of two variables
• fishery exact test
• McNamar chi square test
• chi square test of trends
• log rank chi square test
30
31
Ex. Rolling a die 60 times
1 2 3 4 5 6
obs 6 8 12 15 14 5
exp 10 10 10 10 10 10
(obs-exp) -4 -2 2 5 4 -5
(obs-exp)2 16 4 4 25 16 25
(obs-exp)2/ 1.6 0.4 0.4 2.5 1.6 2.5

exp
32
Cont…
• χ2 = 1.6+0.4+0.4+2.5+1.6+2.5 = 9.0,
• with d.f.= (# of categ.) – 1 = 6 – 1 = 5.
• From Chi-sq. Table, χ25, 0.10= 9.2363,

• i.e., do not reject the null hypothesis, and
• the p-value is about 0.15.
33
34
Test of Independence
• Test of Independence – two categorical
variables are involved, and the observed
and expected frequencies are compared.
Here the expected frequencies are those
the researcher would expect if the two
variables were independent of each other.
35
Cont…
observed
C1 C2
R1 A B A+B
R2 C D C+D
A+C B+D A+B+C+D
expected
36
Example
• Is the digital rectal exam result (DRE)
independent of the biopsy result (BIOP)?
• OBSERVED
DRE+ DRE-
BIOP+ 50 20 70
BIOP- 10 20 30
60 40 100
37
Solution
O E 50 20 70
50 E1 10 20 30
10 E2 60 40 100
20 E3
E1 = (60 * 70)/100 =42

20 E4
38
Cont…
O E (O-E) (O-E)2 (O-E)2/E
50 42 8 64 1.52
10 18 -8 64 3.56
20 28 -8 64 2.29
20 12 8 64 5.33
12.7
X2 with (r-1) * (c-1) d.f.

= (2-1) * (2-1) d.f.
= 1 d.f. 39
Example
• A study was conducted to look at the effects of
oral contraceptives (OC) on heart disease in
women 40 to 44 years of age. It is found that
among 5000 current OC users at baseline, 13
women develop a myocardial infarction (MI) over
a 3-year period, where as among 10,000 non-
OC users, 7 develop an MI over a 3-year period.
– P1 = 0.0026, P2 = 0.0007
– Z-test = 2.77, P-value = 0.006
– There is a highly significant association between MI
and OC use
Display the above data in the form of a 2x2
contingency table
MI status over
OC-use 3 years
group Yes No Total
OC users 13 4987 5000
Non-OC 7 9993 10,000

users
IsTotal
the proportion of MI the 20 14,980
same in OC users 15,000
and non-OC users?
What can be said about the relationship between MI status and OC use?
Example
• Compute the expected table for the OC-MI
data in the previous example
MI status over
OC-use 3 years
group Yes No Total
OC users 13 4987 5000
Non-OC 7 9993 10,000

users
Total 20 14,980 15,000
Example
• Compute the expected table for the OC-MI
data in the previous example
MI status over 3-
OC use group years
Yes No Total
OC users 6.7 4993.3 5000

Non-OC users 13.3 9986.7 10,000
• XTotal
2
≈ 8, 0.001 <p-value
20 < 0.005
14,980 15,000
Example
X2 = 8.30, P-value = 0.004

Example: Observed Numbers
Response by Treatment
Expected Numbers
Shortcut Formula for 2x2 Tables
Example
• A study was conducted to investigate the
possible cause of gastroenteritis outbreak
following a lunch served in a high school
cafeteria. Among the 225 students who
ate the sandwiches, 109 became ill. While,
among the 38 students who did not eat the
sandwiches, 4 became ill.
• Present the data by 2x2 contingency table
• With this method, data are arranged in the
form of a contingency table
• This is a 2 × 2 table for two dichotomous

random variables
• We again wish to know whether the
proportions of students who became ill in
each of the groups are identical
• To carry out the test, we first calculate the
expected counts for the table assuming
that:
H0: p1 = p2
HA: p1 ≠ p2
p1 = 48.44%, p2 = 10.52%
Z test = 4.36
• Expected counts are represented as follows:
• The chi-square test compares the
observed frequencies in each category
with the expected frequencies given that
H0 is true
• Are the deviations between Observed and
Expected too large to be attributed to
chance?
• To determine this, deviations from all 4
cells must be combined
• Calculate the sum:
• The Ho is rejected at α level if X2 is too
large, in particular, if X2 > X21,α
• If α = 0.05, we would reject H0 for X2
greater than X21,α = 3.84
• Therefore, we reject the Ho
• The p-value is given by the area under the
X2 distribution to the right of X2
• P-value < 0.001
Relationship between X2 and Z test
X2 = Z2
19 = (4.36)2
19 ≈ 19.01
Assumptions of the 2 - test
 No expected value in the table is <5, and
no more than 20% of the expected
frequencies should be <5.
 If this does not hold
• - row or column variables categories can
sometimes be combined to make the
expected frequencies larger or
• - use Yates correction
 For 2x2 table, when the total no of
observations is less than 20 or when it is
greater than 20 and the smallest of the
four expected frequencies is < 5,
use Fisher’s Exact test.
Fisher’s Exact Test
• Given the fixed margins, the probability of

obtaining the specific table which was
observed is
• Both the Chi-square test and the exact test
can be generalized to allow the
comparison of three or more proportions
• The data are arranged in the form of an
R × C contingency table
MCNEMAR’S TEST FOR
CORRELATED (DEPENDENT)
PROPORTIONS
61
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS
Basis / Rationale for the Test

 The approximate test previously presented for
assessing a difference in proportions is based upon
the assumption that the two samples are
independent.
 Suppose, however, that we are faced with a situation
where this is not true. Suppose we randomly-select
100 people, and find that 20% of them have flu. Then,
imagine that we apply some type of treatment to all
sampled peoples; and on a post-test, we find that 20%
have flu. 62
 We might be tempted to suppose that no hypothesis

test is required under these conditions, in that the
‘Before’ and ‘After’ p values are identical, and would
surely result in a test statistic value of 0.00.
 The problem with this thinking, however, is that the two
sample p values are dependent, in that each person
was assessed twice. It is possible that the 20 people
that had flu originally still had flu. It is also possible
that the 20 people that had flu on the second test were
a completely different set of 20 people!
63
 It is for precisely this type of situation that

McNemar’s Test for Correlated (Dependent)
Proportions is applicable.
 McNemar’s Test employs two unique features
for testing the two proportions:
* a special fourfold contingency table; with a

* special-purpose chi-square ( 2) test statistic
(the approximate test).
64
Nomenclature for the Fourfold (2 x 2)

Contingency Table
A B (A + B)
C D (C + D)
(A + C) (B + D) n
65
Underlying Assumptions of the Test

1. Construct a 2x2 table where the paired
observations are the sampling units.
2. Each observation must represent a single
joint event possibility; that is, classifiable in
only one cell of the contingency table.
3. In it’s Exact form, this test may be conducted
as a One Sample Binomial for the B & C cells
66
Underlying Assumptions of the Test

• 4. The expected frequency (fe) for the B
and C cells on the contingency table
must be equal to or greater than 5;
where
fe = (B + C) / 2
from the Fourfold table
67
Sample Problem
A randomly selected group of 120 students taking a
standardized test for entrance into college exhibits a
failure rate of 50%. A company which specializes in
coaching students on this type of test has indicated that it
can significantly reduce failure rates through a four-hour
seminar. The students are exposed to this coaching
session, and re-take the test a few weeks later. The
school board is wondering if the results justify paying this
firm to coach all of the students in the high school. Should
they? Test at the 5% level.
68
MCNEMAR’S TEST FOR CORRELATED (DEPENDENT)
PROPORTIONS
Sample Problem
The summary data for this study appear as follows:
Number of Status Before Status After

Students Session Session
4 Fail Fail
4 Pass Fail
56 Fail Pass
56 Pass Pass
69
The data are then entered into the Fourfold

Contingency table:
Before
Pass Fail
Pass 56 56 112
After
Fail 4 4 8
60 70
60 120
 Step I : State the Null & Research Hypotheses

H0 : 
 = 
H1 :   
 
where  and relate to the proportion of observations reflecting changes in

status (the B & C cells in the table)
 Step II : 0.05
71
 Step III : State the Associated Test Statistic
72
 Step IV : State the distribution of the Test Statistic When Ho is

True
2 = 2 with 1 df when Ho is True
73
Step V : Reject Ho if (2 ) > 3.84
X² Distribution
Chi-Square Curve w. df=1
0 X²
3.84
74
 Step VI : Calculate the Value of the Test Statistic


75
Trend test in 2 x c tables
• We had use Chi-squared test to

evaluate if two categorical variables are
associated between them in the
population.
• When one variable is binary(nominal)
and another is ordered categorical
(ordinal), we can be interested in to
compare if their association follow a
trend.
76
Chi squared test for trend
Consider the following data from the Jimma Infant survival project that
relates birth weight with educational status of mothers.
Birth weight (gm)

< 2500 2500+ Total
Highest grade
completed by
mother n % n % n %
Illiterate 114 11.8 849 88.2 963 100.0
Elementary 33 9.0 335 91.0 368 100.0
Junior 6 5.8 98 94.2 104 100.0
Senior HS & above 2 1.3 154 98.7 156 100.0

77
Total 155 9.7 1,436 90.3 1,591 100.0
Cont..
Number of Categories
Variable 1 Variable 2 Method of analysis
2 2 Chi-square test with continuity

correction (Fisher’s exact test)
2 3+ not ordered General Chi-square test of
independence
2 3+ ordered Chi-squared test for trend
3+ not ordered 3+ not ordered General Chi-square test of
independence
3+ ordered 3+ not ordered Kruskall-Wallis test
3+ ordered 3+ ordered Rank correlation
78
Ordered Categories
When we wish to compare frequencies or
proportions among groups which have an ordering,
we should make use of the ordering to increase
the power of the statistical analysis
When the groups are ordered we usually expect

any differences among the groups to be rated to
the ordering.
Failure to take account of ordering of groups is a

common statistical error.
79
SPSS output – Chi squared test
Chi-Square Tests Value df Asymp. Sig. (2-sided)
Pearson Chi-Square 19.6 3 0.0002028
Likelihood Ratio 26.4 3 0.0000078

Linear-by-Linear
Association 19.4 1 0.0000104
N of Valid Cases 1,591

(a) 0 cells (.0%) have expected count less than 5. The minimum
expected count is 10.13.
We look the linear-by-linear association because the educational status is an
ordered data, and the test is called chi-square for trend.
80
Cont…
The total variation among the groups can be
subdivided into that due to:
- a trend in proportions across the groups and
- The remainder
The value for X2 for trend will always be less than

X2 for the overall comparison
If most of the variation is due to a trend across

the groups, then the test for trend will yield a
much smaller p-value.
81
Low birth weight rate by
educational status of mothers
14
12
Low birth weight rate
10
8
6
4
2
0
Illiterate Elemetary Junior Senior HS &
above
Educational status of mothers
82
For the above data on birth weight and education:
The standard X2 = 19.6 with 3 degree of freedom with p <
2
0.001, X Trend  19.4 and on 1 degree of freedom (remember
the df for linear regression line between two variables) p <
0.001.
There is thus strong evidence of linear trend in the

proportion of low birth weight in relation to mothers’
educational status.
Note: This relationship can not be interpreted as casual

relationship
83
Trend test in 2 x c(column) tables
Salt intake
Low Regular High Total
O E O E O E
Hypertension 18 38.5 54 54.1 78 57.4 150
Without 100 79.5 112 111.9 98 118.6 310

hypertension
Total 118 166 176 460
Hypertension Salt Observed Expected O-E (O-E)2 (O-E)2/E

intake
Yes Low 18 38.5 -20.5 420.25 10.9
Yes Regular 54 54.1 - 0.1 0.01 0.0002
Yes High 78 57.4 20.6 424.36 7.4
No Low 100 79.5 20.5 420.25 5.3
No Regular 112 111.9 0.1 0.01 0.00009
84
No High 98 118.6 -20.6 424.36 3.6
CONT…
• Chi-squared test without correct is
27.20 with 2 degree of freedom
p<0.001.
• How can we interpret this result?
• We have interest in if the proportion of
people with hypertension increase or
diminish through the groups.
• To answer this, we need a Chi-
squared test for trends.
85
• We conducted a chi-square test for trend, when we
assess whether a binary variable, varies linearly
through the levels of another variable, to assess
whether there is a dose-response effect.
• The null hypothesis for this test is that
HO: the mean scores in the two groups (the binary
variable) are the same, no trend association
HA: the mean scores in the two groups (the binary
variable) are different, has trend association
• Thus, the Chi square test becomes a test comparing
two means by this is with only one degree of
freedom.
86
CONT…
 Notation used
k k k
N   ni ,R   r i ,P  R / N , X   ni xi
i 1 i 1 i 1
N
 The test statistics x2trend becomes

k
[ r i x i
 Rx ]2
x2
trend  i 1
k
P (1  P )[ r i x i
 Rx ]
i 1
87
Trend test in 2 x c tables
• To calculate this test, assign a numerical score
to each socioeconomic group.
Salt intake
Low Regular High Total
Hypertension (ri) 18 54 78 150 (R)
Without hypertension 100 112 98 310
Total (ni) 118 166 176 460 (N)
Score (xi) 1 2 3
rixi 18 108 234 360
nixi 118 332 528 978
Nixi
03/14/24 19:26 118 Biostatistics664
Advaced MU SoPH 1584 2366 88
Solution
x  2.13
• ∑rixi= 360
• R= 150
• P= 0.326
k
[ r i x i
 Rx ]2
x 2
trend  i 1
k
 5.39
P (1  P )[ r i x i
 Rx ]
i 1
• With this result, we reject the null hypothesis

α=0.05 and d.f. 1; chi critical value is (3.84)
• So, there is strong evidence that there is a trend
89
2. Relative Risk (RR)
• Risk Ratio
• Defined as the ratio of the incidence of
disease in the exposed group divided by
the corresponding incidence of disease in
the non-exposed group
• A point estimate of the risk ratio
(RR=p1/p2) is given by:
Disease
Exposure Yes No Total
Yes a b a+b
No c d c+d
Total a+c b+d N
RR = a/a+b
c/c+d
1st Give Breast Cancer
Birth Yes No Total
≥25 years 31 1597 1628

<25 years 65 4475 4540
RR = a/a+b 96
Total 6072 6168
c/c+d
a/a+b = 31/1628 = 0.019
b/b+d = 65/4540 = 0.014
• Women who first give birth at an older age

are 36% more likely to develop breast
cancer
• To obtain a CI for the RR,
• Where, n1=a+b n2=c+d,

ln=natural logarithm
• Exponentiate each side to get a CI for RR
• For the breast cancer data, a 95% CI for
ln(RR) is
• Consequently, a 95% CI for RR itself is
or
(0.89, 2.08)
• This interval contains the value 1
3. The Odds Ratio
• The odds ratio (OR) is the odds in favor of
disease for the exposed group divided by
the odds in favor of disease for the
unexposed group
• The odds in favor of disease = p/(1-p),
where p = probability of a disease
• Odds = Pr (event occurs) / Pr (event does
not occur) = p/(1-p)
• The odds ratio defined as:
=
• Is estimated by
Example:
• In a study of the risk factors for invasive
cervical cancer, the following data were
collected (Case-Control):
• The odds ratio is estimated by:
• Women with cancer have an odds of

smoking that are 1.52 times the odds of
those without cancer
• A CI can be constructed for OR
• To find a CI for the underlying OR, we first
find a CI for ln(OR) = (c1,c2), where
• Exponentiate the upper and lower confidence
limits for the natural log of the OR:
ˆ 1 1 1 1 ˆ 1 1 1 1
ln OR  Z    ln OR  Z   
a b c d a b c d
e ,e
• For the cervical cancer data,
• Therefore, a 95% CI for ln(OR) is

ln(1.52) ± 1.96(0.166)
or
(0.093, 0.744)
• A 95% CI for the OR itself is
or
(1.10, 2.13)
• This interval does not contain the value 1
• We conclude that the odds of developing
cervical cancer are significantly higher for
smokers than for nonsmokers
Example: Odds of Death
Related to Vit A use (Case-Control Study)
• What is the estimated OR?
• Estimated OR = (46/61)/(74/59)=0.60
• 95% CI = (0.36, 1.04)
2. Quantitative Data
• Previously we focused on measures of the
strength of association between two
dichotomous random variables
• We can also look at the relationship

between two continuous variables
Correlation Analysis
• Measures the strength and direction of the
linear relationship between two continuous
random variables X and Y
Linear regression analysis

• Concerned with predicting or estimating
the value of one variable based on (given)
the value of the other variable. The
regression of Y on X.
A. Correlation
Correlation
Finding the relationship between two

quantitative variables without being able
to infer causal relationships
Correlation is a statistical technique used to

determine the degree to which two
variables are related
14 March 2024 112

Scatter diagram
• Rectangular coordinate
• Two quantitative variables
• One variable is called independent (X) and the
second is called dependent (Y)
• Points are not joined
• No frequency table
14 March 2024 113

Example
14 March 2024 114

SBP(mmHg)
220
200
180
160
140
120
100
80 wt (kg)
60 70 80 90 100 110 120
Scatter diagram of weight and systolic blood pressure

14 March 2024 115
SBP (mmHg)
220
200
180
160
140
120
100
80
Wt (kg)
60 70 80 90 100 110 120
Scatter diagram of weight and systolic blood pressure

14 March 2024 116
Scatter plots
The pattern of data is indicative of the type

of relationship between your two
variables:
 positive relationship
 negative relationship
 no relationship
14 March 2024 117

Positive relationship
14 March 2024 118

18
16
14
12
Height in CM
10
0
0 10 20 30 40 50 60 70 80 90
14 March 2024 Age in Weeks
119
Negative relationship
120
Reliability
Age of Car
14 March 2024
No relation
14 March 2024 121

Correlation Coefficient
Statistic showing the degree of relation

between two variables
14 March 2024 122

Sample Correlation coefficient (r)
 It is also called Pearson's correlation

or product moment correlation
coefficient.
 It measures the nature and strength
between two variables of the
quantitative type.
14 March 2024 123

The sign of r denotes the nature
of association
while the value of r denotes the

strength of association.
14 March 2024 124

 If the sign is +ve this means the relation
is direct (an increase in one variable is
associated with an increase in the other
variable and a decrease in one variable
is associated with a decrease in the other
variable).
 While if the sign is -ve this means an

inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the other).
14 March 2024 125

How to compute the simple correlation
coefficient (r)
 (Xi  X)(Yi  Y)  XY  [ X  Y ] / n
r 
 (Xi  X)  (Yi  Y)
2 2
[ X  ( X ) / n][ Y  ( Y ) / n]
2 2 2 2
14 March 2024 126

Example:
A sample of 6 children was selected, data about their
age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.
127
serial Age Weight
No (years) (Kg)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
14 March 2024 6 9 13
These 2 variables are of the quantitative type, one
variable (Age) is called the independent and
denoted as (X) variable and the other (weight)
is called the dependent and denoted as (Y)
variables to find the relation between age and
weight compute the simple correlation coefficient
using the following formula:
 xy   x y
r  n
 ( x) 2  ( y)2 
x 
2 .  y 
2 
 n  n 
  
14 March 2024 128

Age Weight
Serial
(years) (Kg) xy X2 Y2
.n
(x) (y)
1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 169
Total =x∑ =y∑ xy=∑ =x2∑ =y2∑
41 66 461 291 742
14 March 2024 129
41  66
461 
r 6
 (41)2   (66)2 
291  .742  
 6  6 
r = 0.759
strong direct correlation
14 March 2024 130

EXAMPLE: Relationship between Anxiety and Test
Scores
Anxiety Test X2 Y2 XY
)X( score (Y)
131
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 =∑ Y2 =∑ XY=12∑
14 March 2024 230 204 9
Calculating Correlation Coefficient
(6)(129)  (32)(32) 774  1024

r   .94
6(230)  32 6(204)  32 
2 2
(356)(200)
r = - 0.94
Indirect strong correlation
14 March 2024 132

Spearman Rank Correlation Coefficient (rs)
It is a non-parametric measure of correlation.

This procedure makes use of the two sets of
ranks that may be assigned to the sample
values of x and Y.
Spearman Rank correlation coefficient could be
computed in the following cases:
Both variables are quantitative.
Both variables are qualitative ordinal.
One variable is quantitative and the other is
qualitative ordinal.
14 March 2024 133
Procedure:
1. Rank the values of X from 1 to n where n

is the numbers of pairs of values of X and
Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair of
observation by subtracting the rank of Yi
from the rank of Xi
4. Square each di and compute ∑di2 which
is the sum of the squared values.
14 March 2024 134
5. Apply the following formula
6 (di) 2
rs  1 
n(n 2  1)
The value of rs denotes the magnitude

and nature of association giving the same
interpretation as simple r.
14 March 2024 135

Example
In a study of the relationship between level injury
and income the following data was obtained. Find
the relationship between them and comment.
136
sample level of injury Income
numbers (X) (Y)
A moderate. 25
B mild. 10
C fatal. 8
D Sever. 10
E Sever. 15
F Normal 50
G
14 March 2024
fatal. 60
Answer:
Rank Rank di di2
(X) (Y) X Y
A moderate. 25 5 3 2 4
137
B mild. 10 6 5.5 0.5 0.25
C fatal. 8 1.5 7 -5.5 30.25
D Sever. 10 3.5 5.5 -2 4
E Sever. 15 3.5 4 -0.5 0.25
F Normal 50 7 2 5 25
G fatal. 60 1.5 1 0.5 0.25
∑ di2=64
14 March 2024
6  64
rs  1   0.1
7(48)
Comment:
There is an indirect weak correlation
between level of injury and income.
14 March 2024 138

exercise
14 March 2024 139

What is regression analysis?
• An extension of correlation
• A way of measuring the relationship
between two or more variables.
• Used to calculate the extent to which one
variable changes (DV) when other
variable(s) change (IV(s)).
• Used to help understand possible causal
effects of one variable on another.
14 March 2024 140

What is linear regression (LR)?
• Involves:
– one predictor (IV) and
– one outcome (DV)
• Explains a relationship using a straight line
fit to the data.
14 March 2024 141

Least squares criterion
14 March 2024 142

Least-Squares Regression
 The most common method for fitting a
regression line is the method of least-
squares.
 This method calculates the best-fitting line for
the observed data by minimizing the sum of
the squares of the vertical deviations from
each data point to the line (if a point lies on
the fitted line exactly, then its vertical deviation
is 0).
 Because the deviations are first squared, then
summed, there are no cancellations between
14 Marchpositive
2024 and negative values. 143
Linear Regression - Model
Y
? (the actual value of Yi)
Yi Y  
 X

i
14 March 2024
Xi X
144
Linear Regression - Model
Yi      X i   i Population
Regression Coefficients for a . . .
Yˆ = b0 + b1Xi + e
Sample
ˆ = b0 + b1Xi
Y
14 March 2024 145

Simple Linear Regression Model
•• The
Thepopulation
populationsimple
simplelinear
linearregression
regressionmodel:
model:
y= ++xx or
++  or y|xy|x==++xx
146
y=
Nonrandomor
Nonrandom or Random
Random
Systematic
Systematic Component
Component
Component
Component
•• Where
Where
•• yyisisthe
thedependent
dependent(response)
(response) variable,
variable,the
thevariable
variablewe
wewish
wishto
toexplain
explain
orpredict;
or predict;
•• xxisisthe
theindependent
independent(explanatory)
(explanatory)variable,
variable,also
alsocalled
calledthe
thepredictor
predictor
variable;and
variable; and
•• isisthe
theerror
errorterm,
term,the
theonly
onlyrandom
randomcomponent
componentininthe
themodel,
model,and
andthus,
thus,
theonly
the onlysource
sourceofofrandomness
randomnessininy.y.
14 March 2024
Cont…
• y|x is the mean of y when x is specified,
all called the conditional mean of Y.
  is the intercept of the systematic

component of the regression
relationship.
  is the slope, the gradient or the
regression coefficient of of the
systematic component.
14 March 2024 147
Picturing the Simple Linear Regression Model
Regression Plot •• The
The simple
simple linearlinear regression
regression
Y model posits
model posits an an exact
exact linear
linear
relationship
relationship between
between the
the
expected or or average
average valuevalue ofof
148
expected
Y,
Y,
•• the
the dependent
dependent variablevariable Y, Y, and
and
y|x= +  x
X,the
X, theindependent
independentor orpredictor
predictor
{
y
variable:
Error:  }  = Slope variable:
y|xy|x==++xx
}
1 Actualobserved
Actual observedvalues
valuesofofYY(y)
(y)
differfrom fromthe
theexpected
expectedvalue
value
{
differ
((y|xy|x))by
byan
anunexplained
unexplainedor or
 = Intercept randomerror(
random error():):
X yy== y|xy|x ++ 
0 x
==++xx++
14 March 2024
Assumptions of the Simple Linear
Regression Model
•• The
Therelationship
relationshipbetween
betweenXX
LINE assumptions of the Simple
andYYisisaastraight-Line
and straight-Line(linear)
(linear) Y
Linear Regression Model
relationship.
relationship.
149
•• Thevalues
The valuesof ofthe
theindependent
independent
variableXXare
variable areassumed
assumedfixedfixed
(notrandom);
(not random);the theonly
only
y|x= +  x
randomnessininthe
randomness thevalues
valuesofofYY
comesfrom
comes fromthetheerror term..
errorterm
•• The errorsare
Theerrors areuncorrelated
uncorrelated
y
(i.e.Independent)
(i.e. Independent)inin
successiveobservations.
successive observations.The The
errorsare
errors areNormally
Normally Identical normal
distributedwith
distributed withmean
mean00and and distributions of errors,
all centered on the
variance22(Equal
variance (Equalvariance).
variance).
N(y|x, y|x2) regression line.
•• Thatis:
That is: ~~N(0,
N(0,22))
X
14 March 2024 x
Fitting a Regression Line
Y Y
150
Data
Three errors from the
least squares
X regression line X
Y e
Three errors Errors from the

from a fitted least squares
line regression line are
X X
14 March 2024 minimized
Errors in Regression
Y
151
yˆ  a  bx
yi . the fitted regression line
yˆi
Error ei  yi  yˆi
{ yˆ the predicted value of Y for x
X
xi
14 March 2024
Sums of Squares, Cross Products, and
Least Squares Estimators
Sums of Sq uares and Cross Products:
  (x  x ) 2   x 2 
  x
2
lxx
n 2
lyy   (y  y )   y 
2 2
 y
n
 x ( y )
lxyŷ  a 
bx  ( x  x )( y  y )   xy 
n
Least  squares re gression estimators:
lxy
b  lxx
ŷ  a  bx
a  y bx
14 March 2024 152
Example
  x2
2
2 
Patient x y x2 y2 x ×y 592.6 2
x
xx2  n 41222.14
2
1 22.4 134.0 501.76 17956.0 3001.60 lxxlxx 41222.14592.6 6104.66
6104.66
4 25.1 80.2 630.01 6432.0 2013.02 n 10
10
  y2
8 32.4 97.2 1049.76 9447.8 3149.28 2
2  y 1428.702 2
2
3
51.6
58.1
167.0
132.3
2662.56
3375.61
27889.0
17503.3
8617.20
7686.63 l yyl yyyy2  n 220360.47
220360.471428.70
16242.10
16242.10
5 65.9 100.0 4342.81 10000.0 6590.00 n 1010
7 75.3 187.2 5670.09 35043.8 14096.16
  xyy

xy  n  91866.46 
 x  
592.61428.70
592.6 1428.70
6 79.7 139.1 6352.09 19348.8 11086.27 lxyxy 
l xy 91866.46
10 7201.70
7201.70
10 85.7 199.4 7344.49 39760.4 17088.58 n 10
9 96.4 192.3 9292.96 36979.3 18537.72
7201.70
l lxy 7201.70
bb xylxx 1.18
1.18
Total 592.6 1428.7 41222.14 220360.5 91866.46
lxx 6104.66
6104.66
 592.6 
regression equation: aayybx
bx 1428.7
1428.7
10  (1.18)592.6 
(1.18) 
10 
  10
10 
yˆ  72.96  1.18 x 72.96
72.96
14 March 2024 153

Linear Regression - Variation
SSR
Due to regression.
SST
SST = SSR + SSE SSE
Random/unexplained.
14 March 2024 154

Linear Regression - Variation
Y 
SSE =(Yi - Yi )2
_
SST = (Yi - Y)2
 _
SSR = (Yi - Y)2
_
Y
Xi X
14 March 2024 155
. regress weight age
Source SS df MS Number of obs = 4,390

F(1, 4388) = 167.22
Model 52428415.3 1 52428415.3 Prob > F = 0.0000
Residual 1.3758e+09 4,388 313532.283 R-squared = 0.0367
Adj R-squared = 0.0365
Total 1.4282e+09 4,389 325406.259 Root MSE = 559.94
weight Coef. Std. Err. t P>|t| [95% Conf. Interval]
age 23.60187 1.825173 12.93 0.000 20.02361 27.18013

_cons 2645.316 40.40921 65.46 0.000 2566.093 2724.538
• Measures the strength and direction of
the relationship between two
continuous variables from a single
population
• The relationship should be linear.
• It measures association, not cause and effect
relationship
• Example:
– Relationship between Weight and height
• Procedure
– Display the data in two-way scatter plot of
Y versus X before carrying out any further
analysis
– One variable is plotted on the X-axis
• Independent variable
– The other on the Y-axis
• Dependent or outcome
% Child MR
Nation Immunized per 1000 LB
Bolivia 77 118
Brazil 69 65
Cambodia 32 184
Canada 85 8
China 94 43
Czech Republic 99 12
Egypt 89 55
Ethiopia 13 208
Finland 95 7
France 95 9
Greece 54 9
India 89 124
Italy 95 10
Japan 87 6
Mexico 91 33
Poland 98 16
Russia 73 32
Senegal 47 145
Turkey 76 87
UK 90 9
Percentage of children immunized against DPT and
under-five mortality rate for 20 countries, 1992
250
225
200
<5 mortality rate
175
150
125
100
75
50
25
0
0 25 50 75 100 125
Percentage im m unized
It appears that under-5 mortality rate decreases as the percentage of

children immunized against DPT increases
Correlation Coefficients (ρ, r)
ρ (rho)
• Is a tool for quantifying the true underlying
population correlation between two
variables (X and Y)
• Is generally unknown and estimated by the
sample correlation coefficient (r)
• The Sample (Pearson) Correlation
coefficients (r):
– Are numbers between -1 and +1 where
– The absolute value of the coefficient
measures the strength of the relationships
+1 = Perfect positive linear relationship

-1 = Perfect negative linear relationship
– The values r = 1 and r = −1 imply a perfect
linear relationship between the variables
– The correlation coefficient has no units of
measurement
• If r > 0, then:
– the variables are positively correlated
– two variables (x, y) are positively
correlated if as x increases y tends to
increase, where as x decreases y tends to
decrease
– the variables increase together
Linear relationship between X and Y
• If r < 0, then:
– The variables are negatively correlated
– two variables (x, y) are negatively
correlated if as x increases y tends to
decrease, where as x decreases y tends to
increase
– Inverse relationship
• Example: Pulse rate and age
Negatively correlated X and Y
• If r = 0, then:
– The variables are uncorrelated
– There is no relationship between the two
variables
– The variables are independent
• Example: Birthweight and birthday
Uncorrelated X and Y Variables
• Other names for r:
– Pearson correlation coefficient
– Product moment correlation coefficient
• Characteristics of r
– The value of r is independent of the units used
to measure the variables
– The value of r can be substantially influenced
by a small fraction of outliers
• The r is calculated as
• The easiest equivalent formula is
Simple Example
• We would calculate
• Nearly perfect positive correlation!

Hypothesis Testing
• A test is available to determine whether
the correlation coefficient is equal to zero
(HO: ρ( population correlation=0).
• Like most other tests, the result is usually
summarized by a p-value.
• T-test
• For example r between under-five mortality
rate and percentage of children immunized
against DPT above was calculated to be r =
-0.79, fairly close to –1.
• Testing a HO : ρ=0, we found p <0.001.
• There appears to be significant and strong
linear relationship between under-five
mortality rate and percentage of children
immunized against DPT.
• For the DPT and mortality rate data,
• For a t distribution with 18 df, p < 0.001

• We reject H0 at the α 5% and conclude
that the true population correlation is not
equal to 0, it is < 0
Limitations of the correlation coefficient:
1. It quantifies only the strength of the linear
relationship between two variables
2. It is very sensitive to outlying values, and
thus can sometimes be misleading
3. It cannot be extrapolated beyond the
observed ranges of the variables
4. A high correlation does not imply a cause-
and-effect relationship
Spearman Rank Correlation Coefficient (rs)
It is a non-parametric measure of correlation.
This procedure makes use of the two sets of
ranks that may be assigned to the sample values
of x and Y.
Spearman rank correlation coefficient could be
computed in the following cases:
Both variables are quantitative but skewed.
Both variables are qualitative ordinal.
One variable is quantitative skewed and the other
is qualitative ordinal.
179
Procedure:
1. Rank the values of X from 1 to n where n

is the numbers of pairs of values of X and
Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair of
observation by subtracting the rank of Yi
from the rank of Xi di=xi-yi
4. Square each di and compute ∑di2 which
is the sum of the squared values.
180
5. Apply the following formula
6 (di) 2
rs  1 
n(n 2  1)
The value of rs denotes the magnitude

and nature of association giving the same
interpretation as simple r.
14 March 2024 181

exercise
186
B. Linear Regression
• A regression is a description of a response
measure, Y, the dependent variable, as a
function of an explanatory variable, X, the
independentvariable.
• The goal is the prediction or estimation of
the value of one variable, Y, based on the
value of the other variable, X.
• Simple: One predictor variable (X) used to
predict response (Y)
• Multiple: Two or more predictor variables

(X’s) used to predict response (Y)
Simple Linear Regression (SLR)
• Like correlation, it explore the nature of
the relationship between two
continuous variables.
• The main difference is that regression
looks at the change in one variable (the
response or outcome or dependent
variable) that corresponds to a given
change in the other (the explanatory or
predictor or independent variable)
• Other names:
– Linear regression
– Simple linear regression
– Least squares regression
• The objective of SLR is to predict the
value of y variable from a linear
relationship of x variable.
• Models the relationship between a
dependent variable (the variable to be
predicted) and an independent variable
(the predictor).
• Example: We are interested in the
distribution of the birth weight of infants as
gestational age increases
• What will happen to birth weight as
gestational age increases?
• We can quantify this relationship by
postulating the model in the form of
• E(y / x) is the mean birth weight of infants

whose gestational age is x weeks
• The model represents a straight line
• The parameters α and  are constants and

are the coefficients of the equation
– α is the intercept (or constant) of the line, or
the mean value of the response y when x is
equal to 0
–  is the slope (or gradient), or the change in y
that corresponds to a one unit change in x
• If  is positive, then the expected value
of y increases in magnitude as x increases
• If  is negative, the expected value of y

decreases as x increases
• Due to the natural variability in the data,
we accommodate the error and fit a model
of the form:
• Use the data from the sample to estimate

α and , the coefficients of the regression
line
Using the sample data:
• Where
• The estimators a and b serve as point

estimates of the regression parameters α
and 
• With a different sample, the estimates
would change
Hypothesis Testing
• We are also interested in whether  is
significantly different from zero (HO : =0).
• If b =0, it means that changes in x do not
affect y.
• This is equivalent to saying that we can
not predict y from x, or that x does not
explain y.
• This test of whether  is significantly
different from zero uses the t test statistic.
• The result of fitting a SLR to under-five
mortality rate (y) and percentage of
children immunized against DPT (x) is:
y = 219 – 2.1 x
• The intercept is 219 and the slope is –2.1.
• The equation tells us that countries who
do not have immunization at all [% of
children immunized against DPT = 0] have
an under-five mortality rate of 219 per
1000 live births on average.
• For every additional % of immunization,
under-five mortality rate decreases by 2.1
per 1000 live births.
• Testing a HO :  = 0, we found p<0.001.
• This tells us that immunization against DPT
is a significant predictor of under-five
mortality.
Multiple Linear Regression
• Outcome is modeled as a function of
more than one variables x1, x2, …. xk by:
Y = α + 1x1 + 2x2 + ….+ kxk + e

y = a + b1x1 + b2x2 + … + bnxn.
• Where
– where y is the outcome variable and x 1, x2, . . .xk
are the values of k distinct explanatory variables
. regress weight age order cid mid
Source SS df MS Number of obs = 4,390

F(4, 4385) = 47.06
Model 58785717.8 4 14696429.4 Prob > F = 0.0000
Residual 1.3694e+09 4,385 312297.002 R-squared = 0.0412
Adj R-squared = 0.0403
Total 1.4282e+09 4,389 325406.259 Root MSE = 558.84
weight Coef. Std. Err. t P>|t| [95% Conf. Interval]
age 29.45553 2.258264 13.04 0.000 25.0282 33.88287

order -32.00032 7.301999 -4.38 0.000 -46.31592 -17.68471
cid -.0831916 .0748598 -1.11 0.267 -.2299546 .0635713
mid .0013841 .0011838 1.17 0.242 -.0009368 .003705
_cons 2620.157 47.02924 55.71 0.000 2527.956 2712.358
.
F-distribution
• Used for comparing the variances of two
populations.
• It is defined in terms of ratio of the variances
of two normally distributed populations. So it
is sometimes also called variance ratio.
• F-distribution: if the assumption met,
• (s12/σ12)/(s22/σ22)
s12 = ∑(x1- x⁻1)2/(n1-1)
s22 = ∑(x2- x⁻2)2/(n2-1)
14 March 2024 204
Cont…
• Degrees of freedom: v1=n1 – 1, v2 = n2 – 1
• For different values of v1 and v2 we will get
different distributions, so v1 and v2 are
parameters of F distribution.
• If σ12 = σ22, then, the statistic F = s12/ s22
follows F distribution with n1 – 1 and n2 –
1 degrees of freedom.
14 March 2024 205

Properties of F-distribution
• It is positively skewed and its skewness
decreases with increase in v1 and v2.
• Value of F must always be positive or
zero, since variances are squares. So its
value lies between 0 and ∞.
• Shape of F-distribution depends upon the
number of degrees of freedom.
14 March 2024 206

Testing of hypothesis for equality of two variances
• It is based on the variances in two independently
selected random samples drawn from two normal
populations.
• Null hypothesis H0: σ12 = σ22
• F = (s12/σ12)/(s22/σ22), which reduces to F = s12/ s22
• Degrees of freedom v1 (the numerator) and v2 (the
denominator).
• Find table value using v1 and v2.
• If calculated F value exceeds table F value, hypothesis
is rejected.
14 March 2024 207
Example
• Borden et al, compared meniscal repair
techniques using cadaveric knee specimens. One
of the variables of interest was the load at failure
(in newtons) for knees fixed with the FasT-FIX
technique (group 1) and the vertical suture
method (group 2). Each technique was applied
to six specimens. The sd for the FasT-FIX
method was 30.62, and the sd the vertical suture
method was 11.37. can we conclude that, in
general, the variance of load at failure is higher
for the FasT-FIX technique than the vertical
suture method? α=0.05
14 March 2024 208

solution
• Data: n1=6, n2=6, s1=30.62, s2=11.37

• Assumption: each sample constitutes a simple random
sample of a population of similar subjects. The samples
are independent. We assume the loads at failure in both
populations are approximately normally distributed.
• Hypothesis: Ho: σ12 ≤ σ2
Ha: σ12 > σ22
• Test statistic: variance ratio = F= s12/ s22
14 March 2024 209
Cont…
Distribution of test statistic: when the Ho is true,
the test statistic is distributed as F with n1-1
numerator and n2-1 denominator degrees of
freedom
Decision rule: critical value is 5.05 (from table)
Conclusion of test statistic: VR=7.25
Statistical decision: we reject Ho, since
7.25>5.05
Conclusion: the failure load variability is higher
when using the FasT-FIX method than the
vertical
14 March 2024suture method. 210
14 March 2024 211
Analysis of Variance
A analysis of variance is a technique

that partitions the total sum of squares
of deviations of the observations
about their mean into portions
associated with independent variables
in the experiment and a portion
associated with error
212
A factor refers to a categorical

quantity under examination in an
experiment as a possible cause of
variation in the response variable.
213
Levels refer to the categories,

measurements, or strata of a factor of
interest in the experiment.
214
Types of Experimental Designs
Experimental
Designs
Completely Randomized Factorial

Randomized Block
One-Way Two-Way
Anova Anova
215
Completely Randomized
Design
216
Completely Randomized
Design
1. Experimental Units (Subjects) Are
Assigned Randomly to Treatments
– Subjects are Assumed Homogeneous
2. One Factor or Independent Variable

– > 2 Treatment Levels or groups
3. Analyzed by One-Way ANOVA

217
218
One-Way ANOVA F-Test
1. Tests the Equality of 2 or More (p)

Population Means
2. Variables
– One categorical Independent Variable
– One Continuous Dependent Variable
219
Assumptions
1. Randomness & Independence of Errors

2. Normality
– Populations (for each condition) are
Normally Distributed
3.Homogeneity of Variance
– Populations (for each condition) have Equal
Variances
220
Hypotheses
H0: 1 = 2 = 3 = ... = p
– All Population Means are Equal
– No Treatment Effect
Ha: Not All j Are Equal

– At Least 1 Pop. Mean is Different
– Treatment Effect
 1  2  ...  p
221
Hypotheses
H0: 1 = 2 = 3 = ... = p
– All Population Means f(X)
are Equal
– No Treatment Effect
X
1 = 2 = 3
Ha: Not All j Are Equal
– At Least 1 Pop. Mean
is Different f(X)
– Treatment Effect
 1 = 2 = ... = p X
– Or i ≠ j for some i, j. 1 =  2  3
222
One-Way ANOVA Basic Idea
1. Compares 2 Types of Variation to Test
Equality of Means
2. If Treatment Variation Is Significantly
Greater Than Random Variation then
Means Are Not Equal
3.Variation Measures Are Obtained by
‘Partitioning’ Total Variation
223
One-Way ANOVA
Partitions Total Variation
224
One-Way ANOVA
Total variation
225
One-Way ANOVA
Total variation
Variation due to
treatment
226
One-Way ANOVA
Total variation
Variation due to Variation due to

treatment random sampling
227
One-Way ANOVA
Total variation

Sum of Squares Among

Sum of Squares Between
Sum of Squares Treatment
Among Groups Variation
228
One-Way ANOVA
Total variation

Sum of Squares Among Sum of Squares Within
Sum of Squares Between Sum of Squares Error
Sum of Squares Treatment (SST)
Among Groups Variation (SSE)
Within Groups Variation
229
230
The Model
• μ represents the mean of all the k population

means and is called the grand mean.
• Tj represents the difference between the mean of
the jth population and the grand mean and is
called the treatment effect.
• eij represents the amount by which an individual
measurement differs from the mean of the
population to which it belongs and is called the
error term.
231
Total Variation
SS Total   x11  x 2  x21  x 2    xij  x 2
2 2 2
Response, x
x
Group 1 Group 2 Group 3

232
Treatment Variation
SST  n1 x1  x 2  n2 x2  x 2    n p x p  x 2

2 2 2
Response, x
x3
x
 x2
 x1

233
Random (Error) Variation
SSE  x11  x1   x21  x1     x pj  x p 

2 2 2
Response, x
x3
x2
x1

234
One-Way ANOVA F-Test
Test Statistic
SST /  p  1
• 1. Test Statistic
– F = MST / MSE= V.R.

• MST Is Mean Square for Treatment
SSE / n  p 
• MSE Is Mean Square for Error
• 2. Degrees of Freedom
 1 = p -1
 2 = n - p
• p = # of treatment groups, or Levels
• n = Total Sample Size
235
One-Way ANOVA
Summary Table
Source of Degrees Sum of Mean F
Variation of Squares Square
Freedom (Variance)
Treatment p-1 SST MST = MST
SST/(p - 1) MSE
Error n-p SSE MSE =
SSE/(n - p)
Total n-1 SS(Total) =
SST+SSE
236
One-Way ANOVA F-Test Critical
Value
If means are equal,
F = MST / MSE  1.
Only reject large F! Reject H0
Do Not 
Reject H0
0 F
Fa ( p1, np)
Always One-Tail!
© 1984-1994 T/Maker Co.
237
HOW TO CALCULATE ANOVA’S BY HAND…
Treatment 1 Treatment 2 Treatment 3 Treatment 4
y11 y21 y31 y41 n=10 obs./group
y12 y22 y32 y42
y13 y23 y33 y43 k=4 groups
y14 y24 y34 y44
y15 y25 y35 y45
y16 y26 y36 y46
y17 y27 y37 y47
y18 y28 y38 y48
y19 y29 y39 y49
y110 y210 y310 y410
10
y
10 10 10
1j  y2 j  y3 j y 4j The group means

j 1 j 1
y1  y 2 
j 1
y 3 
j 1 y 4 
10 10 10 10
10 10
(y
10 10
 ( y1 j  y1 ) 2
j 1
2j  y 2 ) 2 
j 1
( y 3 j  y 3 ) 2 (y
j 1
4j  y 4 ) 2
The (within)
j 1
10  1 10  1 10  1 10  1 group variances
238
Sum of Squares Within (SSW), or Sum
of Squares Error (SSE)
10 10
(y
10 10
(y (y
2
 y 2 )
(y  y 4 ) 2
2
1j  y1 ) 2 2j 3j  y 3 ) 4j
j 1 j 1 j 1
The (within)
j 1
group variances
10  1 10  1 10  1 10  1
10 10
 (y
10 10
(y +   y 4 ) 2
2
1j  y1 ) 2 ( y 2 j  y 2 ) 2 + ( y 3 j  y 3 ) + 4j
j 1 j 3 j 1
j 1
4 10
 
i 1 j 1
( y ij  y i  ) 2 Sum of Squares Within (SSW)
(or SSE, for chance error)
239
Sum of Squares Between (SSB), or
Sum of Squares Regression (SSR)
4 10
Overall mean of
all 40  y
i 1 j 1
ij
observations
(“grand mean”) y  
40
(y
Sum of Squares Between
2
10 x  y  )
(SSB). Variability of the
group means compared to
i the grand mean (the
i 1 variability due to the
treatment).
240
Total Sum of Squares (SST)
Total sum of squares(TSS).

4 10

Squared difference of every
( y ij  y  ) 2 observation from the overall

mean. (numerator of
variance of Y!)
i 1 j 1
241
Partitioning of Variance
4 10 4 4 10
 ( y
i 1 j 1
ij  y i ) 2

+10x ( y i   y  ) 2
=  ( y ij  y  ) 2
i 1 i 1 j 1
SSW + SSB = TSS
242
ANOVA Table
Mean Sum
Source of Sum of
of Squares
variation d.f. squares
F-statistic p-value
Between k-1 SSB SSB/k-1 Go to

(k groups) (sum of squared SSB Fk-1,nk-k
k 1
deviations of SSW
group means from nk  k chart
grand mean)
Within nk-k SSW s2=SSW/nk-k

(sum of squared
(n individuals per
deviations of
group)
observations from
their group mean)
Total nk-1 TSS

variation (sum of squared deviations of
observations from grand mean)
TSS=SSB + SSW 243

Example

60 inches 50 48 47
67 52 49 67
42 43 50 54
67 67 55 67
56 67 56 68
62 59 61 65
64 67 61 65
59 64 60 56
72 63 59 60
71 65 64 65
244
Example
Step 1) calculate the sum

of squares between groups: Treatment 1 Treatment 2 Treatment 3 Treatment 4
60 inches 50 48 47
67 52 49 67
Mean for group 1 = 62.0 42 43 50 54

67 67 55 67
Mean for group 2 = 59.7 56 67 56 68
62 59 61 65
Mean for group 3 = 56.3 64 67 61 65
Mean for group 4 = 61.4 59 64 60 56

72 63 59 60
71 65 64 65
Grand mean= 59.85

SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per
group= 19.65x10 = 196.5
245
Example
Step 2) calculate the sum

of squares within groups:
60 inches 50 48 47
67 52 49 67
(60-62) 2+(67-62) 2+ (42-62) 42 43 50 54
2
+ (67-62) 2+ (56-62) 2+ (62-
67 67 55 67
62) 2+ (64-62) 2+ (59-62) 2+
56 67 56 68
(72-62) 2+ (71-62) 2+ (50-
59.7) 2+ (52-59.7) 2+ (43- 62 59 61 65
59.7) 2+67-59.7) 2+ (67- 64 67 61 65

59.7) 2+ (59-59.7) 2…+…. 59 64 60 56
(sum of 40 squared
72 63 59 60
deviations) = 2060.6
71 65 64 65
246
Fill in the ANOVA table
Source of d.f. Sum of squares Mean Sum of F-statistic p-value

variation Squares
Between 3 196.5 65.5 1.14 .344
Within 36 2060.6 57.2
Total 39 2257.1
247
Fill in the ANOVA table
Source of d.f. Sum of squares Mean Sum of F-statistic p-value

variation Squares
Between 3 196.5 65.5
2060.6 57.2 1.14 .344

Within 36
Total 39 2257.1
INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
248
Coefficient of Determination
2 SSB SSB
R  
SSB  SSE SST
The amount of variation in the outcome variable (dependent
variable) that is explained by the predictor (independent
variable).
249

10measures of Association

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10measures of Association

Uploaded by

Copyright:

Available Formats

Measures of Association

• While a test of hypothesis can be used to

1. What is categorical data and what makes it different

2. What are the different methods one can use to

3. Discuss on the chi-square test and the requirements

Categorical variables have two primary types of scales.

Many categorical variables do have ordered categories.

Proportions are a way of expressing counts or

A more general way of showing frequencies is in a

There is a single, general approach to the

• With this method, data are arranged in the

We often interested in exposure-disease

Pipe 14 5 147 117 335

Other protected 34 98 202 138 472

Unprotected source 107 220 292 167 786

Total 155 375 641 422 1593

Low birth weight Total

Asymp. Exact Exact

The null hypothesis is that the two classification

We compare the observed frequencies with what

• The expected numbers must be computed for

• Indexed by the degrees of freedom (n)

The guidelines are that 80% of the cells in the

Note that the observed frequencies are not

•Goodness of Fit of a single variable

(obs-exp)2/ 1.6 0.4 0.4 2.5 1.6 2.5

• with d.f.= (# of categ.) – 1 = 6 – 1 = 5.

• From Chi-sq. Table, χ25, 0.10= 9.2363,

A+C B+D A+B+C+D

E1 = (60 * 70)/100 =42

X2 with (r-1) * (c-1) d.f.

OC users 13 4987 5000

Non-OC 7 9993 10,000

OC users 13 4987 5000

Non-OC 7 9993 10,000

OC users 6.7 4993.3 5000

X2 = 8.30, P-value = 0.004

• This is a 2 × 2 table for two dichotomous

• Given the fixed margins, the probability of

Basis / Rationale for the Test

 We might be tempted to suppose that no hypothesis

 It is for precisely this type of situation that

* a special fourfold contingency table; with a

Nomenclature for the Fourfold (2 x 2)

Underlying Assumptions of the Test

Underlying Assumptions of the Test

Number of Status Before Status After

The data are then entered into the Fourfold

 Step I : State the Null & Research Hypotheses

where  and relate to the proportion of observations reflecting changes in

 Step III : State the Associated Test Statistic

 Step IV : State the distribution of the Test Statistic When Ho is

2 = 2 with 1 df when Ho is True

Step V : Reject Ho if (2 ) > 3.84

Chi-Square Curve w. df=1

 Step VI : Calculate the Value of the Test Statistic

• We had use Chi-squared test to

Birth weight (gm)

Illiterate 114 11.8 849 88.2 963 100.0

Elementary 33 9.0 335 91.0 368 100.0

Junior 6 5.8 98 94.2 104 100.0

Senior HS & above 2 1.3 154 98.7 156 100.0

2 2 Chi-square test with continuity