You are on page 1of 53

30280 Applications for Management

Lecture 4

ANOVA – Analysis of variance


Housekeeping announcements

 Next week: STATA lab-session

 STATA available for installation

 Solutions exercises and lecture notes


on blackboard by Monday

Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 2


Recap: data sources and quality

 The research process  Important concepts include:


1. Define the problem  Common data problems
2. Prepare the research  Primary data
design  Issues in data collection
 Interviewer effects
3. Identify data sources  Errors in data
4. Collect the data  Secondary data
5. Process and analyse the  Univariate data screening
data  Missing values
1. Screen the data and clean
the data file if necessary
 Impossible or improbable values
2. Conduct analyses that address  Outliers
the research question  Non-normal distributions
6. Draw conclusions and  Bivariate data screening
prepare the report  Impossible/improbable combinations of
responses
 Sample size within groups
 Violations of assumptions of parametric
inferential statistics

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 3


Data screening: can you answer
these questions?

 What situations can give rise to  Outliers


problems in data sets?  What are outliers? Why do they create
 What is the goal of data screening? problems?
 How does a box plot help identify
 What types of problems should you
outliers?
look for in the univariate distributions  What can you do to remedy problems
of categorical and metric variables? with outliers?
 What assumptions about data are  Missing values
required for most parametric statistical  What are they?
analyses?
 Why might they occur?
 What procedures can be used for  Why might they create problems?
univariate data screening?  What options are available for dealing
 What procedures can be used for with missing values? What strengths and
bivariate data screening? weaknesses does each option have?
 What remedies are available to deal  What information about treatment of
with different problems that might be variables after data screening should be
identified as the result of data included in a research report?
screening?

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 4


Today’s agenda: an overview of the
research process

1. Define the problem


2. Prepare the research design
3. Identify sources of data and, if necessary, a
sampling plan
4. Collect the data
5. Process and analyse the data
 Step 1: Screen the data and clean the data file if necessary
 Step 2: Conduct analyses that address the research problem
6. Formulate conclusions and prepare a report

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 5


Today’s Agenda

 Understanding how to use the one-way


analysis of variance to test for differences
among the means of several groups.

 Interpreting ANOVA STATA output

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 6


Recall..
Hypothesis testing

 Main concepts
 We make a claim about a population parameter
 Define two alternatives that cover all possible population
outcomes:
 H0 (null hypothesis): hypothesis about a population parameter
that is considered to be true unless sufficient evidence of the
contrary
 H1 (alternative or research hypothesis): hypothesis that is true if
the null is declared to be false
 Accept or reject null hypothesis based on random
population samples (every member of population has the
same probability of being interviewed)

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 7


Example
Evaluating the effect of an intervention
 Research question: Does a given training
course increase employees’ productivity?
 Design:
 Employees are randomly assigned to two groups:
group of attending and not attending the course
 Productivity is measured in terms of time to
completing a given task
 H0 (null hypothesis): the mean time for attendants is equal
to the mean time for non attendants
 H1 (alternative hypothesis): the mean time for attendants is
lower than the mean for non-attendants

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 8


Example
Evaluating the effect of an intervention

 Statistical technique: two-sample t-test


 Run the test and read the results:
 One-tailed t-test: if the mean time in the group of
attendants is smaller than the mean time in the group
of non attendant, we can claim that the research
question is supported by empirical evidence.
 Two-tailed t-test: if the mean time in the two groups
is significantly different, we can state that there is an
impact of the intervention (training course) on
productivity

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 9


When to use ANOVA

 ANOVA (Analysis of variance) is used to investigate


the impact that categorical (nominal) variables have
on a numerical variable. The categorical variables are
referred to as factors, and their values levels.

 If there is only one categorical variable, the analysis is


referred to as one-way ANOVA. If there are n factors
the analysis is called n-way ANOVA
 One-way example: expected earnings in the first job by sex
(variable ‘sex’ is the ‘factor’, while ‘male’ and ‘female’ are
levels)
 Two-way example: expected earnings in the first job by sex
and Applications for management year (2020 vs 2019)

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 10


One-way ANOVA
Intuition

 If the factor takes only two values, the


problem can be addressed by comparing the
means of the two samples, drawn from each
of the two groups
 But when more than two levels are
observed, we cannot rely on the comparison
of the sample means
 The idea is to work in terms of the variability
in the data

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 11


One-way ANOVA
Definitions

 Let Y be a numerical variable and let X be a


categorical variable taking c values (1 factor, c levels)

 According to the values taken by X, the units of the


populations are grouped into c sub-populations

 The probabilistic behaviour of Y in the c sub-


populations is described by means of a probability
distribution

 Evaluating the impact of X on Y is the same as


comparing such c distributions.

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 12


One-way ANOVA
Assumptions

 Y is assumed to follow a normal distribution in


the c subpopulations
 Y is assumed to have the same variance in the
c subpopulations. Let us denote it by 2
The comparison of c distributions reduces to the
comparison of the means of Y in c sub-populations.

Let us denote the mean of Y in population i i, i=1…


c.
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 13
One-way ANOVA
Setting up the hypothesis
Evaluating the impact that a factor X has on a
numerical variable Y reduces to the following
hypothesis testing problem:
H0: Is the mean of Y in the c sub-populations
the same?”

Ho : 1   2  ...  c
Against the alternative hypothesis:
H1: Not all means are the same: at least
one is different from the others.
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 14
Example: packaging vs sales

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 15


Example: packaging vs sales

Are the sales of a product associated with


the type of package design?

Y = Total sales of a product

X = Design of the package, 4 levels, labelled


as 1,2,3,4.
ANOVA question: Have we got enough empirical
evidence to reject the assumption that the mean sales
of the product packaged into 4 different ways are the
same, regardless the design of the package?
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 16
Example: packaging vs sales
Hypothesis definition

 First, define the null and alternative


hypotheses:
 H0: 1 = 2 =…=c
 H1: Not all j are equal

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 17


One-way ANOVA
Setting up the hypothesis (c’ed)

H0 : μ1  μ2  μ3    μc

μ1  μ 2  μ 3

H0: All Means are the same: The Null


Hypothesis is True (No Group Effect)
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 18
One-way ANOVA
Setting up the hypothesis (c’ed)

H1 : Not all μ j are the same

H1- At least one mean is different:


The Null Hypothesis is NOT true

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 19


One-way ANOVA
Intuition: the role of variability
• The variability of the data is key factor to test the equality of means

Variability around sample Large variability around sample


means within group relative to means within group relative to
variability among the sample variability among sample mean
means across groups is small across groups is large

In each case below, the means may look different, but a large
variation within groups in the 2nd graph makes the evidence that the
mean differences are weak
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 20
One-way ANOVA
Intuition: the role of variability

 From the previous graph we get the intuition


that not only the average level of the outcome
is important to find significant differences
among groups but also the variation within
groups.
 The higher is this variability the less likely is
that the groups are different (and thus that we
will find significant differences)

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 21


One-way ANOVA
Intuition: the role of variability

 With ANOVA, we test for differences driven


by one factor only, but many other factors
can potentially drive the observed variability!
 For example, variability on sales might be
driven by the type of store, the location of the
store, the location of the product, the music
that is played in the store, and also
packaging

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 22


One-way ANOVA
Intuition: the role of variability
 BUT we are interested only in the last
source of variance and, therefore, we might
think at splitting it into two parts, one
depending on the factor been considered
and one due to all others factors.

 Comparing the variation due to the factor of


interest and the variation due to all the other
factors is the intuition behind ANOVA!

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 23


Back to the package example
Option 1: variability due to packaging

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 24


Back to the package example
Option 2: no impact of the factor

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 25


Back to the package example
The observed sample

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 26


Partitioning the variation..
Total sum of squares
 Total sum of square (total variation) can be split into two :

SST=SSW+SSG

 SST = Total Sum of Squares


Total Variation = dispersion of the observations with respect to the
overall mean
 SSW = Sum of Squares Within Groups
Within-Group Variation = dispersion of the observations from the
group-specific mean
 SSG = Sum of Squares Between Groups
Between-Group Variation = dispersion between the group sample
means

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 27


Recall..
Definition of variance
 Since ANOVA is the “analysis of variance”, let’s
revisit the concept of sample variance
 Variation is the “average squared deviation (or
difference) of data points from the mean”
 Compute the mean
 Take the distance of each data point from the mean
 Square each distance (squared differences)
 Add these squares together (total sum of squares)
 Find the average squared distance (divide by degrees
of freedom N-1 for sample; N for population)

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 28


Partitioning the variation..
Within- & between-variability (SSW & SSG)

2 2 2 2 2 2

( ) ( ) (
SSW = Y11 - Y1 + Y21 - Y1 +...+ Yncc - Yc ) ( ) ( ) ( )
SSG =n1 Y 1 - Y +n2 Y 2 - Y +...+nc Y c - Y
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 29
One-way ANOVA table

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 30


One-way ANOVA
The F-statistics

MSG
F=
MSW

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 31


One-way ANOVA
The F-statistics

H0: μ1= μ2 = … = μc
H1: At least two population means are different
 Test statistic
MSG
F=
MSW
 MSG is mean squares, “between variance”
 MSW is mean squares, “within variance”
 Degrees of freedom
 df1 = c – 1 (c = number of groups)
 df2 = N – c (N = sum of all sample sizes)

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 32


Recall..
The F distribution
 Fisher-Snedecor distribution
 Used to study the ratio of two variances (e.g.
F=MSG/MSW)
 Two parameters for the degrees of freedom, one
related to the numerator, one to the denominator

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 33


One-way ANOVA
Hypothesis testing

 The F statistic is the ratio of the between variance to


the within variance
 The ratio must always be positive

 df = c -1 will typically be small


1

 df2 = n - c will typically be large

Decision Rule:
Reject H0 if F > FU,
otherwise do not reject H0

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 34


One-way ANOVA
Hypothesis testing

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 35


Back to the package example
Performing ANOVA

Decision: Reject H0. There is enough evidence


to conclude that at least one mean differs. BUT
WHICH ONE??

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 36


The Scheffe post-hoc test

 Tells which population means are significantly


different
 e.g.: μ1 = μ2 ≠ μ3

 Done after rejection of equal means in


ANOVA
 Allows pair-wise comparisons

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 37


Back to the package example
The Scheffe post-hoc test

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 38


Back to the package example
Cool packaging

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 39


Batteries example

You want to see if three Brand 1 Brand 2 Brand 3


different brand of batteries 254 234 200
263 218 222
yield different durations. 241 235 197
237 227 206
251 216 204
You obtain 5 measurements
for each brand. At the .05
significance level, is there a
difference in mean
durations?

Data are on blackboard


(“batteries”)
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 40
Batteries example
Scatter diagram

Brand 1 Brand 2 Brand 3 Duration


254 234 200
263 218 222
241 235 197
237 227 206
251 216 204

y1  249.2 y 2  226.0 y 3  205.8


y  227.0
Brand

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 41


Batteries example
Computations

Brand 1 Brand 2 Brand 3 n1 = 5


254 234 200
263 218 222 n2 = 5
241 235 197
237 227 206 n3 = 5
251 216 204
n = 15
K=3
SSG = 5 (249.2 – 227)2 + 5 (226 – 227)2 + 5 (205.8 – 227)2 = 4716.4
SSW = (254 – 249.2)2 + (263 – 249.2)2 +…+ (204 – 205.8)2 = 1119.6

MSG = 4716.4 / (3-1) = 2358.2 2358.2


F  25.275
MSW = 1119.6 / (15-3) = 93.3 93.3
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 42
Batteries example
F-table

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 43


Batteries example
Solution
H0: μ1 = μ2 = μ3
H1: μj not all equal Test Statistic:
 = .05 MSG 2358.2
df1= 2 = c-1 F= = =25.275
df2 = 12 = N-c MSW 93.3
Critical Value: Decision:
F2,12,.05= 3.89
Reject H0 at  = 0.05
Conclusion:
There is evidence
Do not Reject H0 that at least one μ i
reject H0 F = 25.275
F2,12,.05 = 3.89 differs from the rest
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 44
Batteries example
STATA implementation

The p-value is very low -> We reject Ho -> There is


evidence that at least one μ differs from the rest
i

(the batteries of at least one Brand have duration


different from the other Brands)
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 45
Batteries example
STATA implementation (with Scheffe)

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 46


Golfing example
Now it is your turn..
You want to see if three Club 1 Club 2 Club 3
different golf clubs yield 254 234 200
different distances. 263 218 222
241 235 197
You randomly select five 235 227 206
measurements from trials on 251 216 204
an automated driving
machine for each club.

At the .05 significance level,


is there a difference in mean
distance?
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 47
Golfing example
Hypotheses

Overall Model
 H0: 
CLUB1   CLUB 2   CLUB 3

 H1: at least one mean differs

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 48


Golfing example
Descriptive statistics

Remember to check for normality!

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 49


Golfing example
One-way ANOVA in STATA

c-1 SSG/(c-1) MSG/


N-c SSW/(N-c) MSW

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 50


Golfing example
One-way ANOVA in STATA

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 51


Golfing example
Conclusion

 There is enough evidence to conclude that the three


golf clubs are significantly different, on average, in
terms of their hitting distance (in metres).

 Refer to the means-


 Club 1- 248.8
 Club 2- 226
 Club 3- 205.8

 Therefore, Club 1 hits the longest distance, followed


by Club 2, then Club 3.
N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 52
References

 Review hypothesis testing (NCT, 8th edition chapter


9 «Hypothesis Tests of a Single Population»)
 Anova (NCT, 8th edition, chapter 15 «Analysis of
variance»)
 Lecture notes and slides
 ANOVA tutorial on STATA
https://www.youtube.com/watch?v=XEFGGkFRdD4
&feature=plcp

N. Cavalli Applications for management, 30280, BIEM 2020/2021, Lecture 1, slide 53

You might also like