You are on page 1of 249

Measures of Association

• While a test of hypothesis can be used to


determine whether an association exists
between two random variables, it cannot
provide a measure of the strength of the
association
• Several methods are available for
estimating the magnitude of the effect
given the categorical data in a 2× 2
contingency table
• For the most part, we have been applying
the techniques of hypothesis testing to
either continuous or ordinal data
• What about nominal data?
• Instead of using the normal approximation
to the binomial distribution, we could reach
the same conclusion using different
techniques
1. Categorical Data
Analysis of categorical data
Brainstorming questions:

1. What is categorical data and what makes it different


from other data sets?

2. What are the different methods one can use to


analyze such data?

3. Discuss on the chi-square test and the requirements


associated with it?

5
Categorical data analysis
Definition: Categorical data consists of variables with a
finite number of values – counts rather than
measurements.

Categorical variables have two primary types of scales.


Variables having categories without a natural ordering
are called nominal.

Many categorical variables do have ordered categories.


Such variables are called ordinal.
Ordinal variables have ordered categories, but
distances between categories are unknown.

6
Examples
Categorical data arise in a number of ways
– Binary variables - yes/no, pass/fail, live/die
– Unordered multinomial: Christian, Muslim, Protestant,
Catholic
– Ordinal:
• No education, Elementary, Secondary, College.
• social class: upper, middle, lower
• Imperfect scale measurement: e.g., Likert's 5-point
scale
• Grouped variables: e.g., income in bands

7
The analysis of frequency tables

Proportions are a way of expressing counts or


frequencies when there are only two possible
outcomes, such as the presence or absence of a
symptom.

A more general way of showing frequencies is in a


table, where each cell of the table corresponds to
a particular combination of characteristics relating
to two or more classification.

8
The analysis of frequency tables

There is a single, general approach to the


analysis of all frequency tables, but in
practice the method of analysis varies
according to
– The number of categories
– Whether the categories are ordered or not
– The number of independent groups of
subjects, and
– The nature of the question being asked

9
1. Chi-Square Test
• A Chi-Square (χ2) is a probability
distribution used to make statistical
inferences about categorical data
(proportions) in which the numbers of
categories are two or more.
• Widely used in the analysis of
contingency tables.
• Chi-Square test allows us to test for
association between two categorical
variables.
Ho: No association between the variables.
HA: There is association
• Consequently a significant p-value implies
association.
• Chi-Square test compares observed to
expected counts (frequencies) under the
assumption of no association (or Ho is
true)

• With this method, data are arranged in the


form of a contingency table.
The 2 × 2 frequency table – comparison of two
proportions

We often interested in exposure-disease


relationship as shown in the following table.

Disease
Yes No
Exposure Yes a b a+b=n1

No c d c+d=n2
a+c =m1 b+d =m2

13
The general case – the r × c table
Table 1: Source of water and birth weight, Sub-sample
from Jimma Infant Survival data
Birth weight
< 2,500- 3,000- 3,500 Total
Water source 2,500 2,999 3,499 +

Pipe 14 5 147 117 335

Other protected 34 98 202 138 472

Unprotected source 107 220 292 167 786

Total 155 375 641 422 1593

14
2 × 2 contingency table
Example: Consider the following sub-sampled data from the Jimma
Infant Survival study to look into differences in the proportion of low
birth weight babies between urban and rural residents.

Low birth weight Total


Residence
YES NO
Urban 23 540 563
Rural 90 532 622
Total 113 1072 1185

15
SPSS output – Chi-squared test

Asymp. Exact Exact


Sig. (2- Sig. (2- Sig. (1-
Value df sided) sided) sided)
Pearson Chi-Square
36.939(
b) 1 0.000
Continuity Correction (a) 35.745 1 0.000
Likelihood Ratio 39.581 1 0.000
Fisher's Exact Test 0.000 0.000
Linear-by-Linear Association
36.908 1 0.000
N of Valid Cases 1185
(a) Computed only for a 2x2 table
(b) 0 cells (.0%) have expected count less than 5. The minimum expected count is
53.69. 16
Cont…
The analysis of frequency tables is largely on
hypothesis testing.

The null hypothesis is that the two classification


variables (water source and birth weight) are unrelated
in the relevant population (population from which the
sample is selected).

We compare the observed frequencies with what


we would expect if the null hypotheses was true.

17
Definition
• Chi-Square test is a statistic which
measures the discrepancy between k
observed frequencies O1, O2,…Ok and the
corresponding expected frequencies E1,
E2,… Ek.
• When the Ho of no association is true, the observed and expected
counts will be similar, their difference will be close to zero, resulting
in a SMALL chi square statistic value.
• When the HA of an association is true the observed counts will be
unlike the expected counts, their difference will be non zero and
their squared difference will be positive, resulting in a LARGE
POSITIVE chi square statistic value.
• Chi-Square test is based on the table of Χ2
for different degrees of freedom (df).
• Requires 2x2 table
• If the value of χ2 is zero, no discrepancy
between the observed and the expected
frequencies.
• The greater the discrepancy, the larger will
be the value of χ2.
• The calculated value of χ2 is compared
with the tabulated value for the given df.
Degrees of Freedom
• Counts in the Chi-Square Test of a 2x2
table are represented as “a”, “b”, “c” and
“d”.
• The general calculation:
Expected Value
• Is the product of the row total multiplied by
the column total, divided by the grand total

• The expected numbers must be computed for


each cell.
X Distribution
2

• Indexed by the degrees of freedom (n)


• Unlike z and t distributions, which are always
symmetric about 0, the X2 distribution only
takes on positive values and is always
skewed to the right.
• The skewness diminishes as n increases
Rejection
Acceptance region
region
0,95
0.05

18.307 210
Contingency Table
• A table composed of rows cross-classified
by columns
• A 2x2 contingency table is a table
composed of two rows cross-classified by
two columns
• Appropriate to display data that can be
classified by two different variables, each
of which has only two possible outcomes
Cont…
The use of chi squared distribution for the test
statistic X2 is based on a ‘large sample’
approximation.

The guidelines are that 80% of the cells in the


table should have expected frequencies greater
than 5, and all cells should have expected
frequencies greater than 1.

Note that the observed frequencies are not


involved here, only the expected frequencies.

28
Chi-square requirements
 The sample must be randomly drawn from the population.
 Measured variables must be independent;
 Values/categories on independent and dependent variables
must be mutually exclusive
 At least 80% of the cells should have expected count/
frequencies greater than 5
 All cells should have expected frequencies greater than 1
 Data must be reported in raw frequencies (not percentages);
 All the data in the sample must be used.
 Observed frequencies cannot be too small.

29
Types

•Goodness of Fit of a single variable


•Test of Independence of two variables
• fishery exact test
• McNamar chi square test
• chi square test of trends
• log rank chi square test

30
31
Ex. Rolling a die 60 times
1 2 3 4 5 6

obs 6 8 12 15 14 5

exp 10 10 10 10 10 10

(obs-exp) -4 -2 2 5 4 -5

(obs-exp)2 16 4 4 25 16 25

(obs-exp)2/ 1.6 0.4 0.4 2.5 1.6 2.5


exp
32
Cont…
• χ2 = 1.6+0.4+0.4+2.5+1.6+2.5 = 9.0,

• with d.f.= (# of categ.) – 1 = 6 – 1 = 5.

• From Chi-sq. Table, χ25, 0.10= 9.2363,


• i.e., do not reject the null hypothesis, and
• the p-value is about 0.15.

33
34
Test of Independence
• Test of Independence – two categorical
variables are involved, and the observed
and expected frequencies are compared.
Here the expected frequencies are those
the researcher would expect if the two
variables were independent of each other.

35
Cont…
observed
C1 C2

R1 A B A+B

R2 C D C+D

A+C B+D A+B+C+D

expected

36
Example
• Is the digital rectal exam result (DRE)
independent of the biopsy result (BIOP)?

• OBSERVED
DRE+ DRE-
BIOP+ 50 20 70
BIOP- 10 20 30
60 40 100

37
Solution
O E 50 20 70

50 E1 10 20 30

10 E2 60 40 100

20 E3

E1 = (60 * 70)/100 =42


20 E4

38
Cont…
O E (O-E) (O-E)2 (O-E)2/E

50 42 8 64 1.52

10 18 -8 64 3.56

20 28 -8 64 2.29

20 12 8 64 5.33

12.7

X2 with (r-1) * (c-1) d.f.


= (2-1) * (2-1) d.f.
= 1 d.f. 39
Example
• A study was conducted to look at the effects of
oral contraceptives (OC) on heart disease in
women 40 to 44 years of age. It is found that
among 5000 current OC users at baseline, 13
women develop a myocardial infarction (MI) over
a 3-year period, where as among 10,000 non-
OC users, 7 develop an MI over a 3-year period.
– P1 = 0.0026, P2 = 0.0007
– Z-test = 2.77, P-value = 0.006
– There is a highly significant association between MI
and OC use
Display the above data in the form of a 2x2
contingency table

MI status over
OC-use 3 years
group Yes No Total

OC users 13 4987 5000

Non-OC 7 9993 10,000


users
IsTotal
the proportion of MI the 20 14,980
same in OC users 15,000
and non-OC users?
What can be said about the relationship between MI status and OC use?
Example
• Compute the expected table for the OC-MI
data in the previous example
MI status over
OC-use 3 years
group Yes No Total

OC users 13 4987 5000

Non-OC 7 9993 10,000


users
Total 20 14,980 15,000
Example
• Compute the expected table for the OC-MI
data in the previous example
MI status over 3-
OC use group years
Yes No Total

OC users 6.7 4993.3 5000


Non-OC users 13.3 9986.7 10,000
• XTotal
2
≈ 8, 0.001 <p-value
20 < 0.005
14,980 15,000
Example

X2 = 8.30, P-value = 0.004


Example: Observed Numbers
Response by Treatment
Expected Numbers
Shortcut Formula for 2x2 Tables
Example
• A study was conducted to investigate the
possible cause of gastroenteritis outbreak
following a lunch served in a high school
cafeteria. Among the 225 students who
ate the sandwiches, 109 became ill. While,
among the 38 students who did not eat the
sandwiches, 4 became ill.
• Present the data by 2x2 contingency table
• With this method, data are arranged in the
form of a contingency table

• This is a 2 × 2 table for two dichotomous


random variables
• We again wish to know whether the
proportions of students who became ill in
each of the groups are identical
• To carry out the test, we first calculate the
expected counts for the table assuming
that:
H0: p1 = p2
HA: p1 ≠ p2
p1 = 48.44%, p2 = 10.52%
Z test = 4.36
• Expected counts are represented as follows:
• The chi-square test compares the
observed frequencies in each category
with the expected frequencies given that
H0 is true
• Are the deviations between Observed and
Expected too large to be attributed to
chance?
• To determine this, deviations from all 4
cells must be combined
• Calculate the sum:
• The Ho is rejected at α level if X2 is too
large, in particular, if X2 > X21,α
• If α = 0.05, we would reject H0 for X2
greater than X21,α = 3.84
• Therefore, we reject the Ho
• The p-value is given by the area under the
X2 distribution to the right of X2
• P-value < 0.001
Relationship between X2 and Z test
X2 = Z2
19 = (4.36)2
19 ≈ 19.01
Assumptions of the 2 - test
 No expected value in the table is <5, and
no more than 20% of the expected
frequencies should be <5.
 If this does not hold
• - row or column variables categories can
sometimes be combined to make the
expected frequencies larger or
• - use Yates correction
 For 2x2 table, when the total no of
observations is less than 20 or when it is
greater than 20 and the smallest of the
four expected frequencies is < 5,
use Fisher’s Exact test.
Fisher’s Exact Test

• Given the fixed margins, the probability of


obtaining the specific table which was
observed is
• Both the Chi-square test and the exact test
can be generalized to allow the
comparison of three or more proportions
• The data are arranged in the form of an
R × C contingency table
MCNEMAR’S TEST FOR
CORRELATED (DEPENDENT)
PROPORTIONS

61
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

Basis / Rationale for the Test


 The approximate test previously presented for
assessing a difference in proportions is based upon
the assumption that the two samples are
independent.
 Suppose, however, that we are faced with a situation
where this is not true. Suppose we randomly-select
100 people, and find that 20% of them have flu. Then,
imagine that we apply some type of treatment to all
sampled peoples; and on a post-test, we find that 20%
have flu. 62
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

 We might be tempted to suppose that no hypothesis


test is required under these conditions, in that the
‘Before’ and ‘After’ p values are identical, and would
surely result in a test statistic value of 0.00.
 The problem with this thinking, however, is that the two
sample p values are dependent, in that each person
was assessed twice. It is possible that the 20 people
that had flu originally still had flu. It is also possible
that the 20 people that had flu on the second test were
a completely different set of 20 people!
63
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

 It is for precisely this type of situation that


McNemar’s Test for Correlated (Dependent)
Proportions is applicable.
 McNemar’s Test employs two unique features
for testing the two proportions:

* a special fourfold contingency table; with a


* special-purpose chi-square ( 2) test statistic
(the approximate test).
64
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

Nomenclature for the Fourfold (2 x 2)


Contingency Table

A B (A + B)

C D (C + D)

(A + C) (B + D) n
65
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

Underlying Assumptions of the Test


1. Construct a 2x2 table where the paired
observations are the sampling units.
2. Each observation must represent a single
joint event possibility; that is, classifiable in
only one cell of the contingency table.
3. In it’s Exact form, this test may be conducted
as a One Sample Binomial for the B & C cells
66
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

Underlying Assumptions of the Test


• 4. The expected frequency (fe) for the B
and C cells on the contingency table
must be equal to or greater than 5;
where
fe = (B + C) / 2
from the Fourfold table
67
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

Sample Problem
A randomly selected group of 120 students taking a
standardized test for entrance into college exhibits a
failure rate of 50%. A company which specializes in
coaching students on this type of test has indicated that it
can significantly reduce failure rates through a four-hour
seminar. The students are exposed to this coaching
session, and re-take the test a few weeks later. The
school board is wondering if the results justify paying this
firm to coach all of the students in the high school. Should
they? Test at the 5% level.
68
MCNEMAR’S TEST FOR CORRELATED (DEPENDENT)
PROPORTIONS

Sample Problem
The summary data for this study appear as follows:

Number of Status Before Status After


Students Session Session
4 Fail Fail
4 Pass Fail
56 Fail Pass
56 Pass Pass
69
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

The data are then entered into the Fourfold


Contingency table:
Before
Pass Fail

Pass 56 56 112
After
Fail 4 4 8

60 70
60 120
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

 Step I : State the Null & Research Hypotheses


H0 : 
 = 
H1 :   
 

where  and relate to the proportion of observations reflecting changes in


status (the B & C cells in the table)

 Step II : 0.05
71
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

 Step III : State the Associated Test Statistic

72
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

 Step IV : State the distribution of the Test Statistic When Ho is


True

2 = 2 with 1 df when Ho is True

73
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

Step V : Reject Ho if (2 ) > 3.84

X² Distribution

Chi-Square Curve w. df=1

0 X²
3.84
74
MCNEMAR’S TEST FOR CORRELATED
(DEPENDENT) PROPORTIONS

 Step VI : Calculate the Value of the Test Statistic


75
Trend test in 2 x c tables

• We had use Chi-squared test to


evaluate if two categorical variables are
associated between them in the
population.
• When one variable is binary(nominal)
and another is ordered categorical
(ordinal), we can be interested in to
compare if their association follow a
trend.
76
Chi squared test for trend
Consider the following data from the Jimma Infant survival project that
relates birth weight with educational status of mothers.

Birth weight (gm)


< 2500 2500+ Total
Highest grade
completed by
mother n % n % n %

Illiterate 114 11.8 849 88.2 963 100.0

Elementary 33 9.0 335 91.0 368 100.0

Junior 6 5.8 98 94.2 104 100.0

Senior HS & above 2 1.3 154 98.7 156 100.0


77
Total 155 9.7 1,436 90.3 1,591 100.0
Cont..
Number of Categories
Variable 1 Variable 2 Method of analysis

2 2 Chi-square test with continuity


correction (Fisher’s exact test)
2 3+ not ordered General Chi-square test of
independence
2 3+ ordered Chi-squared test for trend
3+ not ordered 3+ not ordered General Chi-square test of
independence
3+ ordered 3+ not ordered Kruskall-Wallis test
3+ ordered 3+ ordered Rank correlation
78
Ordered Categories
When we wish to compare frequencies or
proportions among groups which have an ordering,
we should make use of the ordering to increase
the power of the statistical analysis

When the groups are ordered we usually expect


any differences among the groups to be rated to
the ordering.

Failure to take account of ordering of groups is a


common statistical error.

79
SPSS output – Chi squared test
Chi-Square Tests Value df Asymp. Sig. (2-sided)

Pearson Chi-Square 19.6 3 0.0002028

Likelihood Ratio 26.4 3 0.0000078


Linear-by-Linear
Association 19.4 1 0.0000104

N of Valid Cases 1,591


(a) 0 cells (.0%) have expected count less than 5. The minimum
expected count is 10.13.
We look the linear-by-linear association because the educational status is an
ordered data, and the test is called chi-square for trend.

80
Cont…
The total variation among the groups can be
subdivided into that due to:
- a trend in proportions across the groups and
- The remainder

The value for X2 for trend will always be less than


X2 for the overall comparison

If most of the variation is due to a trend across


the groups, then the test for trend will yield a
much smaller p-value.
81
Low birth weight rate by
educational status of mothers
14
12
Low birth weight rate

10
8
6
4
2
0
Illiterate Elemetary Junior Senior HS &
above
Educational status of mothers

82
For the above data on birth weight and education:
The standard X2 = 19.6 with 3 degree of freedom with p <
2
0.001, X Trend  19.4 and on 1 degree of freedom (remember
the df for linear regression line between two variables) p <
0.001.

There is thus strong evidence of linear trend in the


proportion of low birth weight in relation to mothers’
educational status.

Note: This relationship can not be interpreted as casual


relationship

83
Trend test in 2 x c(column) tables
Salt intake
Low Regular High Total
O E O E O E
Hypertension 18 38.5 54 54.1 78 57.4 150

Without 100 79.5 112 111.9 98 118.6 310


hypertension
Total 118 166 176 460

Hypertension Salt Observed Expected O-E (O-E)2 (O-E)2/E


intake
Yes Low 18 38.5 -20.5 420.25 10.9
Yes Regular 54 54.1 - 0.1 0.01 0.0002
Yes High 78 57.4 20.6 424.36 7.4
No Low 100 79.5 20.5 420.25 5.3
No Regular 112 111.9 0.1 0.01 0.00009
84
No High 98 118.6 -20.6 424.36 3.6
CONT…
• Chi-squared test without correct is
27.20 with 2 degree of freedom
p<0.001.
• How can we interpret this result?
• We have interest in if the proportion of
people with hypertension increase or
diminish through the groups.
• To answer this, we need a Chi-
squared test for trends.
85
• We conducted a chi-square test for trend, when we
assess whether a binary variable, varies linearly
through the levels of another variable, to assess
whether there is a dose-response effect.
• The null hypothesis for this test is that
HO: the mean scores in the two groups (the binary
variable) are the same, no trend association
HA: the mean scores in the two groups (the binary
variable) are different, has trend association
• Thus, the Chi square test becomes a test comparing
two means by this is with only one degree of
freedom.

86
CONT…
 Notation used
k k k
N   ni ,R   r i ,P  R / N , X   ni xi
i 1 i 1 i 1
N

 The test statistics x2trend becomes


k
[ r i x i
 Rx ]2
x2
trend  i 1
k
P (1  P )[ r i x i
 Rx ]
i 1
87
Trend test in 2 x c tables
• To calculate this test, assign a numerical score
to each socioeconomic group.
Salt intake
Low Regular High Total

Hypertension (ri) 18 54 78 150 (R)

Without hypertension 100 112 98 310

Total (ni) 118 166 176 460 (N)

Score (xi) 1 2 3

rixi 18 108 234 360

nixi 118 332 528 978

Nixi
03/14/24 19:26 118 Biostatistics664
Advaced MU SoPH 1584 2366 88
Solution

x  2.13
• ∑rixi= 360
• R= 150
• P= 0.326
k
[ r i x i
 Rx ]2
x 2
trend  i 1
k
 5.39
P (1  P )[ r i x i
 Rx ]
i 1

• With this result, we reject the null hypothesis


α=0.05 and d.f. 1; chi critical value is (3.84)
• So, there is strong evidence that there is a trend
89
2. Relative Risk (RR)
• Risk Ratio
• Defined as the ratio of the incidence of
disease in the exposed group divided by
the corresponding incidence of disease in
the non-exposed group
• A point estimate of the risk ratio
(RR=p1/p2) is given by:
Disease
Exposure Yes No Total

Yes a b a+b
No c d c+d
Total a+c b+d N
RR = a/a+b
c/c+d
1st Give Breast Cancer
Birth Yes No Total

≥25 years 31 1597 1628


<25 years 65 4475 4540

RR = a/a+b 96
Total 6072 6168
c/c+d
a/a+b = 31/1628 = 0.019
b/b+d = 65/4540 = 0.014

• Women who first give birth at an older age


are 36% more likely to develop breast
cancer
• To obtain a CI for the RR,

• Where, n1=a+b n2=c+d,


ln=natural logarithm
• Exponentiate each side to get a CI for RR
• For the breast cancer data, a 95% CI for
ln(RR) is
• Consequently, a 95% CI for RR itself is

or
(0.89, 2.08)
• This interval contains the value 1
3. The Odds Ratio
• The odds ratio (OR) is the odds in favor of
disease for the exposed group divided by
the odds in favor of disease for the
unexposed group
• The odds in favor of disease = p/(1-p),
where p = probability of a disease
• Odds = Pr (event occurs) / Pr (event does
not occur) = p/(1-p)
• The odds ratio defined as:

=
• Is estimated by
Example:
• In a study of the risk factors for invasive
cervical cancer, the following data were
collected (Case-Control):
• The odds ratio is estimated by:

• Women with cancer have an odds of


smoking that are 1.52 times the odds of
those without cancer
• A CI can be constructed for OR
• To find a CI for the underlying OR, we first
find a CI for ln(OR) = (c1,c2), where
• Exponentiate the upper and lower confidence
limits for the natural log of the OR:

ˆ 1 1 1 1 ˆ 1 1 1 1
ln OR  Z    ln OR  Z   
a b c d a b c d
e ,e
• For the cervical cancer data,

• Therefore, a 95% CI for ln(OR) is


ln(1.52) ± 1.96(0.166)
or
(0.093, 0.744)
• A 95% CI for the OR itself is

or
(1.10, 2.13)
• This interval does not contain the value 1
• We conclude that the odds of developing
cervical cancer are significantly higher for
smokers than for nonsmokers
Example: Odds of Death
Related to Vit A use (Case-Control Study)
• What is the estimated OR?

• Estimated OR = (46/61)/(74/59)=0.60
• 95% CI = (0.36, 1.04)
2. Quantitative Data
• Previously we focused on measures of the
strength of association between two
dichotomous random variables

• We can also look at the relationship


between two continuous variables
Correlation Analysis
• Measures the strength and direction of the
linear relationship between two continuous
random variables X and Y

Linear regression analysis


• Concerned with predicting or estimating
the value of one variable based on (given)
the value of the other variable. The
regression of Y on X.
A. Correlation
Correlation

Finding the relationship between two


quantitative variables without being able
to infer causal relationships

Correlation is a statistical technique used to


determine the degree to which two
variables are related

14 March 2024 112


Scatter diagram
• Rectangular coordinate
• Two quantitative variables
• One variable is called independent (X) and the
second is called dependent (Y)
• Points are not joined
• No frequency table

14 March 2024 113


Example

14 March 2024 114


SBP(mmHg)

220
200
180
160
140
120
100
80 wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood pressure


14 March 2024 115
SBP (mmHg)
220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood pressure


14 March 2024 116
Scatter plots

The pattern of data is indicative of the type


of relationship between your two
variables:
 positive relationship
 negative relationship
 no relationship

14 March 2024 117


Positive relationship

14 March 2024 118


18

16

14

12
Height in CM

10

0
0 10 20 30 40 50 60 70 80 90
14 March 2024 Age in Weeks
119
Negative relationship

120
Reliability

Age of Car
14 March 2024
No relation

14 March 2024 121


Correlation Coefficient

Statistic showing the degree of relation


between two variables

14 March 2024 122


Sample Correlation coefficient (r)

 It is also called Pearson's correlation


or product moment correlation
coefficient.
 It measures the nature and strength
between two variables of the
quantitative type.

14 March 2024 123


The sign of r denotes the nature
of association

while the value of r denotes the


strength of association.

14 March 2024 124


 If the sign is +ve this means the relation
is direct (an increase in one variable is
associated with an increase in the other
variable and a decrease in one variable
is associated with a decrease in the other
variable).

 While if the sign is -ve this means an


inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the other).

14 March 2024 125


How to compute the simple correlation
coefficient (r)

 (Xi  X)(Yi  Y)  XY  [ X  Y ] / n
r 
 (Xi  X)  (Yi  Y)
2 2
[ X  ( X ) / n][ Y  ( Y ) / n]
2 2 2 2

14 March 2024 126


Example:
A sample of 6 children was selected, data about their
age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.

127
serial Age Weight
No (years) (Kg)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
14 March 2024 6 9 13
These 2 variables are of the quantitative type, one
variable (Age) is called the independent and
denoted as (X) variable and the other (weight)
is called the dependent and denoted as (Y)
variables to find the relation between age and
weight compute the simple correlation coefficient
using the following formula:

 xy   x y
r  n
 ( x) 2  ( y)2 
x 
2 .  y 
2 
 n  n 
  

14 March 2024 128


Age Weight
Serial
(years) (Kg) xy X2 Y2
.n
(x) (y)
1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 169
Total =x∑ =y∑ xy=∑ =x2∑ =y2∑
41 66 461 291 742
14 March 2024 129
41  66
461 
r 6
 (41)2   (66)2 
291  .742  
 6  6 

r = 0.759
strong direct correlation

14 March 2024 130


EXAMPLE: Relationship between Anxiety and Test
Scores
Anxiety Test X2 Y2 XY
)X( score (Y)

131
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 =∑ Y2 =∑ XY=12∑
14 March 2024 230 204 9
Calculating Correlation Coefficient

(6)(129)  (32)(32) 774  1024


r   .94
6(230)  32 6(204)  32 
2 2
(356)(200)

r = - 0.94

Indirect strong correlation

14 March 2024 132


Spearman Rank Correlation Coefficient (rs)

It is a non-parametric measure of correlation.


This procedure makes use of the two sets of
ranks that may be assigned to the sample
values of x and Y.
Spearman Rank correlation coefficient could be
computed in the following cases:
Both variables are quantitative.
Both variables are qualitative ordinal.
One variable is quantitative and the other is
qualitative ordinal.
14 March 2024 133
Procedure:

1. Rank the values of X from 1 to n where n


is the numbers of pairs of values of X and
Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair of
observation by subtracting the rank of Yi
from the rank of Xi
4. Square each di and compute ∑di2 which
is the sum of the squared values.
14 March 2024 134
5. Apply the following formula

6 (di) 2
rs  1 
n(n 2  1)

The value of rs denotes the magnitude


and nature of association giving the same
interpretation as simple r.

14 March 2024 135


Example
In a study of the relationship between level injury
and income the following data was obtained. Find
the relationship between them and comment.

136
sample level of injury Income
numbers (X) (Y)
A moderate. 25
B mild. 10
C fatal. 8
D Sever. 10
E Sever. 15
F Normal 50
G
14 March 2024
fatal. 60
Answer:
Rank Rank di di2
(X) (Y) X Y
A moderate. 25 5 3 2 4

137
B mild. 10 6 5.5 0.5 0.25
C fatal. 8 1.5 7 -5.5 30.25
D Sever. 10 3.5 5.5 -2 4
E Sever. 15 3.5 4 -0.5 0.25
F Normal 50 7 2 5 25
G fatal. 60 1.5 1 0.5 0.25

∑ di2=64

14 March 2024
6  64
rs  1   0.1
7(48)

Comment:
There is an indirect weak correlation
between level of injury and income.

14 March 2024 138


exercise

14 March 2024 139


What is regression analysis?
• An extension of correlation
• A way of measuring the relationship
between two or more variables.
• Used to calculate the extent to which one
variable changes (DV) when other
variable(s) change (IV(s)).
• Used to help understand possible causal
effects of one variable on another.

14 March 2024 140


What is linear regression (LR)?
• Involves:
– one predictor (IV) and
– one outcome (DV)
• Explains a relationship using a straight line
fit to the data.

14 March 2024 141


Least squares criterion

14 March 2024 142


Least-Squares Regression
 The most common method for fitting a
regression line is the method of least-
squares.
 This method calculates the best-fitting line for
the observed data by minimizing the sum of
the squares of the vertical deviations from
each data point to the line (if a point lies on
the fitted line exactly, then its vertical deviation
is 0).
 Because the deviations are first squared, then
summed, there are no cancellations between
14 Marchpositive
2024 and negative values. 143
Linear Regression - Model
Y
? (the actual value of Yi)
Yi Y  
 X


i

14 March 2024
Xi X
144
Linear Regression - Model

Yi      X i   i Population

Regression Coefficients for a . . .

Yˆ = b0 + b1Xi + e
Sample

ˆ = b0 + b1Xi
Y

14 March 2024 145


Simple Linear Regression Model
•• The
Thepopulation
populationsimple
simplelinear
linearregression
regressionmodel:
model:

y= ++xx or
++  or y|xy|x==++xx

146
y=
Nonrandomor
Nonrandom or Random
Random
Systematic
Systematic Component
Component
Component
Component
•• Where
Where
•• yyisisthe
thedependent
dependent(response)
(response) variable,
variable,the
thevariable
variablewe
wewish
wishto
toexplain
explain
orpredict;
or predict;
•• xxisisthe
theindependent
independent(explanatory)
(explanatory)variable,
variable,also
alsocalled
calledthe
thepredictor
predictor
variable;and
variable; and
•• isisthe
theerror
errorterm,
term,the
theonly
onlyrandom
randomcomponent
componentininthe
themodel,
model,and
andthus,
thus,
theonly
the onlysource
sourceofofrandomness
randomnessininy.y.

14 March 2024
Cont…
• y|x is the mean of y when x is specified,
all called the conditional mean of Y.

  is the intercept of the systematic


component of the regression
relationship.
  is the slope, the gradient or the
regression coefficient of of the
systematic component.
14 March 2024 147
Picturing the Simple Linear Regression Model
Regression Plot •• The
The simple
simple linearlinear regression
regression
Y model posits
model posits an an exact
exact linear
linear
relationship
relationship between
between the
the
expected or or average
average valuevalue ofof

148
expected
Y,
Y,
•• the
the dependent
dependent variablevariable Y, Y, and
and
y|x= +  x
X,the
X, theindependent
independentor orpredictor
predictor

{
y
variable:
Error:  }  = Slope variable:
y|xy|x==++xx
}

1 Actualobserved
Actual observedvalues
valuesofofYY(y)
(y)
differfrom fromthe
theexpected
expectedvalue
value

{
differ
((y|xy|x))by
byan
anunexplained
unexplainedor or
 = Intercept randomerror(
random error():):

X yy== y|xy|x ++ 
0 x
==++xx++
14 March 2024
Assumptions of the Simple Linear
Regression Model
•• The
Therelationship
relationshipbetween
betweenXX
LINE assumptions of the Simple
andYYisisaastraight-Line
and straight-Line(linear)
(linear) Y
Linear Regression Model
relationship.
relationship.

149
•• Thevalues
The valuesof ofthe
theindependent
independent
variableXXare
variable areassumed
assumedfixedfixed
(notrandom);
(not random);the theonly
only
y|x= +  x
randomnessininthe
randomness thevalues
valuesofofYY
comesfrom
comes fromthetheerror term..
errorterm
•• The errorsare
Theerrors areuncorrelated
uncorrelated
y
(i.e.Independent)
(i.e. Independent)inin
successiveobservations.
successive observations.The The
errorsare
errors areNormally
Normally Identical normal
distributedwith
distributed withmean
mean00and and distributions of errors,
all centered on the
variance22(Equal
variance (Equalvariance).
variance).
N(y|x, y|x2) regression line.

•• Thatis:
That is: ~~N(0,
N(0,22))
X
14 March 2024 x
Fitting a Regression Line
Y Y

150
Data
Three errors from the
least squares
X regression line X
Y e

Three errors Errors from the


from a fitted least squares
line regression line are
X X
14 March 2024 minimized
Errors in Regression
Y

151
yˆ  a  bx
yi . the fitted regression line

yˆi
Error ei  yi  yˆi
{ yˆ the predicted value of Y for x

X
xi
14 March 2024
Sums of Squares, Cross Products, and
Least Squares Estimators
Sums of Sq uares and Cross Products:
  (x  x ) 2   x 2 
  x
2

lxx
n 2
lyy   (y  y )   y 
2 2
 y
n
 x ( y )
lxyŷ  a 
bx  ( x  x )( y  y )   xy 
n
Least  squares re gression estimators:
lxy
b  lxx
ŷ  a  bx
a  y bx
14 March 2024 152
Example
  x2
2
2 
Patient x y x2 y2 x ×y 592.6 2
x
xx2  n 41222.14
2
1 22.4 134.0 501.76 17956.0 3001.60 lxxlxx 41222.14592.6 6104.66
6104.66
4 25.1 80.2 630.01 6432.0 2013.02 n 10
10
  y2
8 32.4 97.2 1049.76 9447.8 3149.28 2
2  y 1428.702 2
2
3
51.6
58.1
167.0
132.3
2662.56
3375.61
27889.0
17503.3
8617.20
7686.63 l yyl yyyy2  n 220360.47
220360.471428.70
16242.10
16242.10
5 65.9 100.0 4342.81 10000.0 6590.00 n 1010
7 75.3 187.2 5670.09 35043.8 14096.16
  xyy

xy  n  91866.46 
 x  
592.61428.70
592.6 1428.70
6 79.7 139.1 6352.09 19348.8 11086.27 lxyxy 
l xy 91866.46
10 7201.70
7201.70
10 85.7 199.4 7344.49 39760.4 17088.58 n 10
9 96.4 192.3 9292.96 36979.3 18537.72
7201.70
l lxy 7201.70
bb xylxx 1.18
1.18
Total 592.6 1428.7 41222.14 220360.5 91866.46

lxx 6104.66
6104.66
 592.6 
regression equation: aayybx
bx 1428.7
1428.7
10  (1.18)592.6 
(1.18) 
10 
  10
10 
yˆ  72.96  1.18 x 72.96
72.96

14 March 2024 153


Linear Regression - Variation

SSR

Due to regression.

SST

SST = SSR + SSE SSE

Random/unexplained.

14 March 2024 154


Linear Regression - Variation
Y 
SSE =(Yi - Yi )2
_
SST = (Yi - Y)2

 _
SSR = (Yi - Y)2
_
Y

Xi X
14 March 2024 155
. regress weight age

Source SS df MS Number of obs = 4,390


F(1, 4388) = 167.22
Model 52428415.3 1 52428415.3 Prob > F = 0.0000
Residual 1.3758e+09 4,388 313532.283 R-squared = 0.0367
Adj R-squared = 0.0365
Total 1.4282e+09 4,389 325406.259 Root MSE = 559.94

weight Coef. Std. Err. t P>|t| [95% Conf. Interval]

age 23.60187 1.825173 12.93 0.000 20.02361 27.18013


_cons 2645.316 40.40921 65.46 0.000 2566.093 2724.538
• Measures the strength and direction of
the relationship between two
continuous variables from a single
population
• The relationship should be linear.
• It measures association, not cause and effect
relationship
• Example:
– Relationship between Weight and height
• Procedure
– Display the data in two-way scatter plot of
Y versus X before carrying out any further
analysis
– One variable is plotted on the X-axis
• Independent variable
– The other on the Y-axis
• Dependent or outcome
% Child MR
Nation Immunized per 1000 LB
Bolivia 77 118
Brazil 69 65
Cambodia 32 184
Canada 85 8
China 94 43
Czech Republic 99 12
Egypt 89 55
Ethiopia 13 208
Finland 95 7
France 95 9
Greece 54 9
India 89 124
Italy 95 10
Japan 87 6
Mexico 91 33
Poland 98 16
Russia 73 32
Senegal 47 145
Turkey 76 87
UK 90 9
Percentage of children immunized against DPT and
under-five mortality rate for 20 countries, 1992
250
225
200
<5 mortality rate

175
150
125
100
75
50
25
0
0 25 50 75 100 125
Percentage im m unized

It appears that under-5 mortality rate decreases as the percentage of


children immunized against DPT increases
Correlation Coefficients (ρ, r)
ρ (rho)
• Is a tool for quantifying the true underlying
population correlation between two
variables (X and Y)
• Is generally unknown and estimated by the
sample correlation coefficient (r)
• The Sample (Pearson) Correlation
coefficients (r):
– Are numbers between -1 and +1 where
– The absolute value of the coefficient
measures the strength of the relationships

+1 = Perfect positive linear relationship


-1 = Perfect negative linear relationship
– The values r = 1 and r = −1 imply a perfect
linear relationship between the variables
– The correlation coefficient has no units of
measurement
• If r > 0, then:
– the variables are positively correlated
– two variables (x, y) are positively
correlated if as x increases y tends to
increase, where as x decreases y tends to
decrease
– the variables increase together
Linear relationship between X and Y
• If r < 0, then:
– The variables are negatively correlated
– two variables (x, y) are negatively
correlated if as x increases y tends to
decrease, where as x decreases y tends to
increase
– Inverse relationship
• Example: Pulse rate and age
Negatively correlated X and Y
• If r = 0, then:
– The variables are uncorrelated
– There is no relationship between the two
variables
– The variables are independent
• Example: Birthweight and birthday
Uncorrelated X and Y Variables
• Other names for r:
– Pearson correlation coefficient
– Product moment correlation coefficient
• Characteristics of r
– The value of r is independent of the units used
to measure the variables
– The value of r can be substantially influenced
by a small fraction of outliers
• The r is calculated as
• The easiest equivalent formula is
Simple Example
• We would calculate

• Nearly perfect positive correlation!


Hypothesis Testing
• A test is available to determine whether
the correlation coefficient is equal to zero
(HO: ρ( population correlation=0).
• Like most other tests, the result is usually
summarized by a p-value.
• T-test
• For example r between under-five mortality
rate and percentage of children immunized
against DPT above was calculated to be r =
-0.79, fairly close to –1.
• Testing a HO : ρ=0, we found p <0.001.
• There appears to be significant and strong
linear relationship between under-five
mortality rate and percentage of children
immunized against DPT.
• For the DPT and mortality rate data,

• For a t distribution with 18 df, p < 0.001


• We reject H0 at the α 5% and conclude
that the true population correlation is not
equal to 0, it is < 0
Limitations of the correlation coefficient:
1. It quantifies only the strength of the linear
relationship between two variables
2. It is very sensitive to outlying values, and
thus can sometimes be misleading
3. It cannot be extrapolated beyond the
observed ranges of the variables
4. A high correlation does not imply a cause-
and-effect relationship
Spearman Rank Correlation Coefficient (rs)
It is a non-parametric measure of correlation.
This procedure makes use of the two sets of
ranks that may be assigned to the sample values
of x and Y.
Spearman rank correlation coefficient could be
computed in the following cases:
Both variables are quantitative but skewed.
Both variables are qualitative ordinal.
One variable is quantitative skewed and the other
is qualitative ordinal.

179
Procedure:

1. Rank the values of X from 1 to n where n


is the numbers of pairs of values of X and
Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair of
observation by subtracting the rank of Yi
from the rank of Xi di=xi-yi
4. Square each di and compute ∑di2 which
is the sum of the squared values.
180
5. Apply the following formula

6 (di) 2
rs  1 
n(n 2  1)

The value of rs denotes the magnitude


and nature of association giving the same
interpretation as simple r.

14 March 2024 181


exercise

186
B. Linear Regression
• A regression is a description of a response
measure, Y, the dependent variable, as a
function of an explanatory variable, X, the
independentvariable.
• The goal is the prediction or estimation of
the value of one variable, Y, based on the
value of the other variable, X.
• Simple: One predictor variable (X) used to
predict response (Y)

• Multiple: Two or more predictor variables


(X’s) used to predict response (Y)
Simple Linear Regression (SLR)
• Like correlation, it explore the nature of
the relationship between two
continuous variables.
• The main difference is that regression
looks at the change in one variable (the
response or outcome or dependent
variable) that corresponds to a given
change in the other (the explanatory or
predictor or independent variable)
• Other names:
– Linear regression
– Simple linear regression
– Least squares regression
• The objective of SLR is to predict the
value of y variable from a linear
relationship of x variable.
• Models the relationship between a
dependent variable (the variable to be
predicted) and an independent variable
(the predictor).
• Example: We are interested in the
distribution of the birth weight of infants as
gestational age increases
• What will happen to birth weight as
gestational age increases?
• We can quantify this relationship by
postulating the model in the form of

• E(y / x) is the mean birth weight of infants


whose gestational age is x weeks
• The model represents a straight line

• The parameters α and  are constants and


are the coefficients of the equation
– α is the intercept (or constant) of the line, or
the mean value of the response y when x is
equal to 0
–  is the slope (or gradient), or the change in y
that corresponds to a one unit change in x
• If  is positive, then the expected value
of y increases in magnitude as x increases

• If  is negative, the expected value of y


decreases as x increases
• Due to the natural variability in the data,
we accommodate the error and fit a model
of the form:

• Use the data from the sample to estimate


α and , the coefficients of the regression
line
Using the sample data:
• Where

• The estimators a and b serve as point


estimates of the regression parameters α
and 
• With a different sample, the estimates
would change
Hypothesis Testing
• We are also interested in whether  is
significantly different from zero (HO : =0).
• If b =0, it means that changes in x do not
affect y.
• This is equivalent to saying that we can
not predict y from x, or that x does not
explain y.
• This test of whether  is significantly
different from zero uses the t test statistic.
• The result of fitting a SLR to under-five
mortality rate (y) and percentage of
children immunized against DPT (x) is:
y = 219 – 2.1 x
• The intercept is 219 and the slope is –2.1.
• The equation tells us that countries who
do not have immunization at all [% of
children immunized against DPT = 0] have
an under-five mortality rate of 219 per
1000 live births on average.
• For every additional % of immunization,
under-five mortality rate decreases by 2.1
per 1000 live births.
• Testing a HO :  = 0, we found p<0.001.
• This tells us that immunization against DPT
is a significant predictor of under-five
mortality.
Multiple Linear Regression
• Outcome is modeled as a function of
more than one variables x1, x2, …. xk by:

Y = α + 1x1 + 2x2 + ….+ kxk + e


y = a + b1x1 + b2x2 + … + bnxn.

• Where
– where y is the outcome variable and x 1, x2, . . .xk
are the values of k distinct explanatory variables
. regress weight age order cid mid

Source SS df MS Number of obs = 4,390


F(4, 4385) = 47.06
Model 58785717.8 4 14696429.4 Prob > F = 0.0000
Residual 1.3694e+09 4,385 312297.002 R-squared = 0.0412
Adj R-squared = 0.0403
Total 1.4282e+09 4,389 325406.259 Root MSE = 558.84

weight Coef. Std. Err. t P>|t| [95% Conf. Interval]

age 29.45553 2.258264 13.04 0.000 25.0282 33.88287


order -32.00032 7.301999 -4.38 0.000 -46.31592 -17.68471
cid -.0831916 .0748598 -1.11 0.267 -.2299546 .0635713
mid .0013841 .0011838 1.17 0.242 -.0009368 .003705
_cons 2620.157 47.02924 55.71 0.000 2527.956 2712.358

.
F-distribution
• Used for comparing the variances of two
populations.
• It is defined in terms of ratio of the variances
of two normally distributed populations. So it
is sometimes also called variance ratio.
• F-distribution: if the assumption met,
• (s12/σ12)/(s22/σ22)
s12 = ∑(x1- x⁻1)2/(n1-1)
s22 = ∑(x2- x⁻2)2/(n2-1)
14 March 2024 204
Cont…
• Degrees of freedom: v1=n1 – 1, v2 = n2 – 1
• For different values of v1 and v2 we will get
different distributions, so v1 and v2 are
parameters of F distribution.
• If σ12 = σ22, then, the statistic F = s12/ s22
follows F distribution with n1 – 1 and n2 –
1 degrees of freedom.

14 March 2024 205


Properties of F-distribution
• It is positively skewed and its skewness
decreases with increase in v1 and v2.
• Value of F must always be positive or
zero, since variances are squares. So its
value lies between 0 and ∞.
• Shape of F-distribution depends upon the
number of degrees of freedom.

14 March 2024 206


Testing of hypothesis for equality of two variances
• It is based on the variances in two independently
selected random samples drawn from two normal
populations.
• Null hypothesis H0: σ12 = σ22
• F = (s12/σ12)/(s22/σ22), which reduces to F = s12/ s22
• Degrees of freedom v1 (the numerator) and v2 (the
denominator).
• Find table value using v1 and v2.
• If calculated F value exceeds table F value, hypothesis
is rejected.
14 March 2024 207
Example
• Borden et al, compared meniscal repair
techniques using cadaveric knee specimens. One
of the variables of interest was the load at failure
(in newtons) for knees fixed with the FasT-FIX
technique (group 1) and the vertical suture
method (group 2). Each technique was applied
to six specimens. The sd for the FasT-FIX
method was 30.62, and the sd the vertical suture
method was 11.37. can we conclude that, in
general, the variance of load at failure is higher
for the FasT-FIX technique than the vertical
suture method? α=0.05

14 March 2024 208


solution

• Data: n1=6, n2=6, s1=30.62, s2=11.37


• Assumption: each sample constitutes a simple random
sample of a population of similar subjects. The samples
are independent. We assume the loads at failure in both
populations are approximately normally distributed.
• Hypothesis: Ho: σ12 ≤ σ2
Ha: σ12 > σ22
• Test statistic: variance ratio = F= s12/ s22
14 March 2024 209
Cont…
Distribution of test statistic: when the Ho is true,
the test statistic is distributed as F with n1-1
numerator and n2-1 denominator degrees of
freedom
Decision rule: critical value is 5.05 (from table)
Conclusion of test statistic: VR=7.25
Statistical decision: we reject Ho, since
7.25>5.05
Conclusion: the failure load variability is higher
when using the FasT-FIX method than the
vertical
14 March 2024suture method. 210
14 March 2024 211
Analysis of Variance

A analysis of variance is a technique


that partitions the total sum of squares
of deviations of the observations
about their mean into portions
associated with independent variables
in the experiment and a portion
associated with error
212
Analysis of Variance

A factor refers to a categorical


quantity under examination in an
experiment as a possible cause of
variation in the response variable.

213
Analysis of Variance

Levels refer to the categories,


measurements, or strata of a factor of
interest in the experiment.

214
Types of Experimental Designs

Experimental
Designs

Completely Randomized Factorial


Randomized Block

One-Way Two-Way
Anova Anova

215
Completely Randomized
Design

216
Completely Randomized
Design
1. Experimental Units (Subjects) Are
Assigned Randomly to Treatments
– Subjects are Assumed Homogeneous

2. One Factor or Independent Variable


– > 2 Treatment Levels or groups

3. Analyzed by One-Way ANOVA


217
218
One-Way ANOVA F-Test

1. Tests the Equality of 2 or More (p)


Population Means

2. Variables
– One categorical Independent Variable
– One Continuous Dependent Variable

219
Assumptions

1. Randomness & Independence of Errors


2. Normality
– Populations (for each condition) are
Normally Distributed
3.Homogeneity of Variance
– Populations (for each condition) have Equal
Variances

220
Hypotheses
H0: 1 = 2 = 3 = ... = p
– All Population Means are Equal
– No Treatment Effect

Ha: Not All j Are Equal


– At Least 1 Pop. Mean is Different
– Treatment Effect
 1  2  ...  p

221
Hypotheses

H0: 1 = 2 = 3 = ... = p
– All Population Means f(X)
are Equal
– No Treatment Effect
X
1 = 2 = 3
Ha: Not All j Are Equal
– At Least 1 Pop. Mean
is Different f(X)
– Treatment Effect
 1 = 2 = ... = p X
– Or i ≠ j for some i, j. 1 =  2  3
222
One-Way ANOVA Basic Idea
1. Compares 2 Types of Variation to Test
Equality of Means
2. If Treatment Variation Is Significantly
Greater Than Random Variation then
Means Are Not Equal
3.Variation Measures Are Obtained by
‘Partitioning’ Total Variation
223
One-Way ANOVA
Partitions Total Variation

224
One-Way ANOVA
Partitions Total Variation

Total variation

225
One-Way ANOVA
Partitions Total Variation

Total variation

Variation due to
treatment

226
One-Way ANOVA
Partitions Total Variation

Total variation

Variation due to Variation due to


treatment random sampling

227
One-Way ANOVA
Partitions Total Variation

Total variation

Variation due to Variation due to


treatment random sampling

Sum of Squares Among


Sum of Squares Between
Sum of Squares Treatment
Among Groups Variation
228
One-Way ANOVA
Partitions Total Variation
Total variation

Variation due to Variation due to


treatment random sampling
Sum of Squares Among Sum of Squares Within
Sum of Squares Between Sum of Squares Error
Sum of Squares Treatment (SST)
Among Groups Variation (SSE)
Within Groups Variation
229
230
The Model

• μ represents the mean of all the k population


means and is called the grand mean.
• Tj represents the difference between the mean of
the jth population and the grand mean and is
called the treatment effect.
• eij represents the amount by which an individual
measurement differs from the mean of the
population to which it belongs and is called the
error term.

231
Total Variation
SS Total   x11  x 2  x21  x 2    xij  x 2
2 2 2

Response, x

x

Group 1 Group 2 Group 3


232
Treatment Variation

SST  n1 x1  x 2  n2 x2  x 2    n p x p  x 2


2 2 2

Response, x
x3
x
 x2
 x1

Group 1 Group 2 Group 3


233
Random (Error) Variation

SSE  x11  x1   x21  x1     x pj  x p 


2 2 2

Response, x

x3
x2
x1

Group 1 Group 2 Group 3


234
One-Way ANOVA F-Test
Test Statistic

SST /  p  1
• 1. Test Statistic
– F = MST / MSE= V.R.

• MST Is Mean Square for Treatment
SSE / n  p 
• MSE Is Mean Square for Error

• 2. Degrees of Freedom
 1 = p -1
 2 = n - p
• p = # of treatment groups, or Levels
• n = Total Sample Size
235
One-Way ANOVA
Summary Table
Source of Degrees Sum of Mean F
Variation of Squares Square
Freedom (Variance)
Treatment p-1 SST MST = MST
SST/(p - 1) MSE
Error n-p SSE MSE =
SSE/(n - p)
Total n-1 SS(Total) =
SST+SSE
236
One-Way ANOVA F-Test Critical
Value
If means are equal,
F = MST / MSE  1.
Only reject large F! Reject H0

Do Not 
Reject H0

0 F
Fa ( p1, np)

Always One-Tail!
© 1984-1994 T/Maker Co.
237
HOW TO CALCULATE ANOVA’S BY HAND…
Treatment 1 Treatment 2 Treatment 3 Treatment 4
y11 y21 y31 y41 n=10 obs./group
y12 y22 y32 y42
y13 y23 y33 y43 k=4 groups
y14 y24 y34 y44
y15 y25 y35 y45
y16 y26 y36 y46
y17 y27 y37 y47
y18 y28 y38 y48
y19 y29 y39 y49
y110 y210 y310 y410
10

y
10 10 10

1j  y2 j  y3 j y 4j The group means


j 1 j 1
y1  y 2 
j 1
y 3 
j 1 y 4 
10 10 10 10
10 10
(y
10 10

 ( y1 j  y1 ) 2

j 1
2j  y 2 ) 2 
j 1
( y 3 j  y 3 ) 2 (y
j 1
4j  y 4 ) 2
The (within)
j 1

10  1 10  1 10  1 10  1 group variances

238
Sum of Squares Within (SSW), or Sum
of Squares Error (SSE)
10 10
(y
10 10

(y (y
2
 y 2 )
(y  y 4 ) 2
2
1j  y1 ) 2 2j 3j  y 3 ) 4j
j 1 j 1 j 1
The (within)
j 1
group variances
10  1 10  1 10  1 10  1

10 10

 (y
10 10

(y +   y 4 ) 2
2
1j  y1 ) 2 ( y 2 j  y 2 ) 2 + ( y 3 j  y 3 ) + 4j
j 1 j 3 j 1
j 1

4 10
 
i 1 j 1
( y ij  y i  ) 2 Sum of Squares Within (SSW)
(or SSE, for chance error)

239
Sum of Squares Between (SSB), or
Sum of Squares Regression (SSR)

4 10
Overall mean of
all 40  y
i 1 j 1
ij
observations
(“grand mean”) y  
40

(y
Sum of Squares Between
2
10 x  y  )
(SSB). Variability of the
group means compared to
i the grand mean (the
i 1 variability due to the
treatment).

240
Total Sum of Squares (SST)

Total sum of squares(TSS).


4 10


Squared difference of every

( y ij  y  ) 2 observation from the overall


mean. (numerator of
variance of Y!)
i 1 j 1

241
Partitioning of Variance

4 10 4 4 10
 ( y
i 1 j 1
ij  y i ) 2

+10x ( y i   y  ) 2
=  ( y ij  y  ) 2
i 1 i 1 j 1

SSW + SSB = TSS

242
ANOVA Table
Mean Sum
Source of Sum of
of Squares
variation d.f. squares
F-statistic p-value

Between k-1 SSB SSB/k-1 Go to


(k groups) (sum of squared SSB Fk-1,nk-k
k 1
deviations of SSW
group means from nk  k chart
grand mean)

Within nk-k SSW s2=SSW/nk-k


(sum of squared
(n individuals per
deviations of
group)
observations from
their group mean)

Total nk-1 TSS


variation (sum of squared deviations of
observations from grand mean)

TSS=SSB + SSW 243


Example

Treatment 1 Treatment 2 Treatment 3 Treatment 4


60 inches 50 48 47
67 52 49 67
42 43 50 54
67 67 55 67
56 67 56 68
62 59 61 65
64 67 61 65
59 64 60 56
72 63 59 60
71 65 64 65

244
Example

Step 1) calculate the sum


of squares between groups: Treatment 1 Treatment 2 Treatment 3 Treatment 4
60 inches 50 48 47
67 52 49 67

Mean for group 1 = 62.0 42 43 50 54


67 67 55 67
Mean for group 2 = 59.7 56 67 56 68
62 59 61 65
Mean for group 3 = 56.3 64 67 61 65

Mean for group 4 = 61.4 59 64 60 56


72 63 59 60
71 65 64 65

Grand mean= 59.85


SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per
group= 19.65x10 = 196.5
245
Example

Step 2) calculate the sum


Treatment 1 Treatment 2 Treatment 3 Treatment 4
of squares within groups:
60 inches 50 48 47

67 52 49 67
(60-62) 2+(67-62) 2+ (42-62) 42 43 50 54
2
+ (67-62) 2+ (56-62) 2+ (62-
67 67 55 67
62) 2+ (64-62) 2+ (59-62) 2+
56 67 56 68
(72-62) 2+ (71-62) 2+ (50-
59.7) 2+ (52-59.7) 2+ (43- 62 59 61 65

59.7) 2+67-59.7) 2+ (67- 64 67 61 65


59.7) 2+ (59-59.7) 2…+…. 59 64 60 56
(sum of 40 squared
72 63 59 60
deviations) = 2060.6
71 65 64 65

246
Fill in the ANOVA table

Source of d.f. Sum of squares Mean Sum of F-statistic p-value


variation Squares

Between 3 196.5 65.5 1.14 .344

Within 36 2060.6 57.2

Total 39 2257.1

247
Fill in the ANOVA table

Source of d.f. Sum of squares Mean Sum of F-statistic p-value


variation Squares

Between 3 196.5 65.5

2060.6 57.2 1.14 .344


Within 36

Total 39 2257.1

INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%

248
Coefficient of Determination

2 SSB SSB
R  
SSB  SSE SST
The amount of variation in the outcome variable (dependent
variable) that is explained by the predictor (independent
variable).

249

You might also like