x 
z =
mean of sample
standard deviation of sample
Original distribution will be transformed to one in which
the mean becomes 0 and
the standard deviation becomes 1
A zscore quantifies the original score in terms of
the number of standard deviations that the score is
from the mean of the distribution.
=> Use zscores to filter outliers
AnalyzeDescriptive StatisticsDescriptives...
Slide 22
Logarithmic transformation
Works for data that are skewed right.
Works for data where residuals get bigger for bigger values of the dependent variable.
Such trends in the residuals occur often, if the error in the value of an
outcome variable is a percent of the value rather than an absolute value.
For the same percent error, a bigger value of the variable means a bigger absolute error,
so residuals are bigger too.
Taking logs "pulls in" the residuals for the bigger values.
log(Y*error) = log(Y) + log(error)
Transformation rule
f(x) =log(x); x 1
f(x) =log(x+1); x 0
>
>
size (in cm)
200 190 180 170 160 150
w
e
i
g
h
t
(
i
n
k
g
)
100
90
80
70
60
50
40
Example: Body size against weight
Slide 23
Logarithmic transformation I
Symmetry
A logarithmic transformation reduces
positive skewness because it compresses
the upper tail of the distribution while
stretching out the lower trail. This is be
cause the distances between 0.1 and 1, 1
and 10, 10 and 100, and 100 and 1000
are the same in the logarithmic scale.
This is illustrated by the histogram of
data simulated with salary (hourly wag
es) in a sample of nurses*. In the origi
nal scale, the data are longtailed to the
right, but after a logarithmic transfor
mation is applied, the distribution is
symmetric. The lines between the two
histograms connect original values with
their logarithms to demonstrate the
compression of the upper tail and
stretching of the lower tail.
*More to come in chapter "ANOVA".
Histogram of
original data
Histogram of
transformed data
Slide 24
Logarithmic transformation II
skewed right
Histogram of
original data
Histogram of
transformed data
Transformation
y = log10(x)
nearly normal distributed
Slide 25
Summary: Data transformation
Linear transformation and logarithmic transformation as discussed above.
Other transformations
Root functions
1/2 1/3
f(x) = x ,x ;x 0 >
usable for right skewed distributions
Hyperbola function
1
f(x) = x ;x 1 >
usable for right skewed distributions
BoxCoxtransformation
f(x) = x ; >1
p
ln( )
1 p
usable for left skewed distributions
Probit & Logit functions (cf. logistic regression)
p
f (p) ln( ); p [0,1]
1 p
= e
usable for proportions and percentages
Interpretation and usage
Interpretation is not always easy.
Transformation can influence results significantly.
Look at your data and decide if it makes sense in the context of your study.
Slide 26
Data trimming
Data trimming deals with
Finding outliers and extremes in a data set.
Dealing with outliers: Correction, deletion, discussion, robust estimation
Dealing with missing values: Correction, treatment (SPSS), (also imputation)
Transforming data if necessary (see chapter above).
Finding outliers and extremes
Get an overview over the dataset!
How does distribution looks like?
Arte there any values that are not expected?
Methods?
Use basic statistics: <Analyze> with <Frequencies>, <Explore> and <Descriptives>
Outliers => e.g. zscores higher/lower 2 st. dev., extremes => higher/lower 3 st. dev.
Use graphical techniques: Histogram, Boxplot, QQ plot,
Outliers => e.g. as indicated in boxplot
Slide 27
Boxplot
A Boxplot displays the center (median), spread and outliers of a distribution.
See exercise for more details about whiskers, outliers etc.
income
60.0
80.0
100.0
120.0
140.0
196
88
83
92
"Box" identifies the
middle 50% of datset
Median
Whisker
Whisker
Outliers (Number in Dataset)
income
60.0
80.0
100.0
120.0
140.0
196
88
83
92
"Box" identifies the
middle 50% of datset
Median
Whisker
Whisker
Outliers (Number in Dataset)
Boxplots are an excellent tool for detecting
and illustrating location and variation
changes between different groups of data.
2 3 4 5 6 7
educ
60.0
80.0
100.0
120.0
140.0
i
n
c
o
m
e
196
191
83
65
168
88
190
92
income
i
n
c
o
m
e
education
Slide 28
Boxplot and error bars
Boxplot Error bars
Keyword "median"
Overview over data and illustration of data
distribution (range, skewness, outliers)
Keyword "mean"
Overview over mean and confidence interval
or standard error
2 3 4 5 6 7
educ
60.0
80.0
100.0
120.0
140.0
i
n
c
o
m
e
196
191
83
65
168
88
190
92
2 3 4 5 6 7
educ
74.0
76.0
78.0
80.0
82.0
84.0
86.0
88.0
90.0
92.0
9
5
%
C
I
i
n
c
o
m
e
Slide 29
QQ plot
The quantilequantile (qq) plot is a graphical technique for deciding if two samples come from
populations with the same distribution.
Quantile: the fraction (or percent) of data points below a given value.
For example the 0.5 (or 50%) quantile is the position at which 50% percent of the data fall below
and 50% fall above that value. In fact, the 50% quantile is the median.
Sample Distribution (simulated data)
50% Quantile 50% Quantile
Normal Distribution
Slide 30
In the qq plot, quantiles of the first sample are set against the quantiles of the second sample.
If the two sets come from a population with the same distribution, the points should fall
approximately along a 45degree reference.
The greater the displacement from this reference line, the greater the evidence for the
conclusion that the two data sets have come from populations with different distributions.
Some advantages of the qq plot are:
The sample sizes do not need to be equal.
Many distributional aspects can be simultaneously tested.
Difference between QQ plot and PP plot
A qq plot is better when assessing the goodness of fit in the tail of the distributions.
The normal qq plot is more sensitive to deviances from normality in the tails of the distribution,
whereas the normal pp plot is more sensitive to deviances near the mean of the distribution.
QQ plot: Plots the quantiles of a varia
ble's distribution against the quantiles of
any of a number of test distributions.
PP plot: Plots a variable's cumulative pro
portions against the cumulative proportions
of any of a number of test distributions.
Slide 31
Quantiles of the first sample are set against the quantiles of the second sample.
S
t
a
n
d
a
r
d
N
o
r
m
a
l
D
i
s
t
r
i
b
u
t
i
o
n
Sample Distribution (simulated data)
S
t
a
n
d
a
r
d
N
o
r
m
a
l
D
i
s
t
r
i
b
u
t
i
o
n
Normal Distribution
Slide 32
Example of qq plot with simulated data
Normal vs. Standard Normal Sample Distribution vs. Standard Normal
2.000000 4.000000 6.000000 8.000000 10.000000
SP1_s
0
100
200
300
H
u
f
i
g
k
e
i
t
Mean = 6.01061013
Std. Dev. =
0.990727744
N = 4'066
SP1_s
0.000000 5.000000 10.000000 15.000000
SP2_s
0
100
200
300
H
u
f
i
g
k
e
i
t
Mean = 4.96828962
Std. Dev. =
2.206443598
N = 4'066
SP2_s
3 4 5 6 7 8 9
Beobachteter Wert
3
4
5
6
7
8
9
E
r
w
a
r
t
e
t
e
r
W
e
r
t
v
o
n
N
o
r
m
a
l
QQDiagramm von Normal von SP1_s
2 0 2 4 6 8 10 12 14 16
Beobachteter Wert
2
0
2
4
6
8
10
12
E
r
w
a
r
t
e
t
e
r
W
e
r
t
v
o
n
N
o
r
m
a
l
QQDiagramm von Normal von SP2_s
S
t
a
n
d
a
r
d
N
o
r
m
a
l
S
t
a
n
d
a
r
d
N
o
r
m
a
l
Simulated data Simulated data
T
e
s
t
d
i
s
t
r
i
b
u
t
i
o
n
(
S
P
S
S
)
T
e
s
t
d
i
s
t
r
i
b
u
t
i
o
n
(
S
P
S
S
)
Sample Distribution Normal
Slide 33
Example
Dataset "Data_07.sav" (Tschernobyl fallout of radioactivity)
Distribution of original data Distribution of log transformed data
Slide 34
Exercises 01: Log Transformation & Data Trimming
Ressources => www.schwarzpartners.ch/ZNZ_2012 => Exercises Analysis => Exercise 01
Slide 35
Linear Regression
Example
Medical research: Dependence of age and systolic blood pressure
140
150
160
170
180
190
200
210
220
230
240
35 40 45 50 55 60 65 70 75 80 85 90
S
y
s
t
o
l
i
c
b
l
o
o
d
p
r
e
s
s
u
r
e
[
m
m
H
G
]
Age [years]
Dataset (EXAMPLE01.SAV)
Sample of n = 10 men
Variables for
age (age)
systolic blood pressure (pressure)
Typical questions
Is there a linear relation between
age and systolic blood pressure?
What is the predicted mean blood
pressure for men aged 67?
Slide 36
The questions
Question in everyday language:
Is there a linear relation between age and systolic blood pressure?
Research question:
What is the relation between age and systolic blood pressure?
What kind of model is best for showing the relation? Is regression analysis the right model?
Statistical question:
Forming hypothesis
H
0
: "No model" (= No overall model and no significant coefficients)
H
A
: "Model" (= Overall model and significant coefficients)
Can we reject H
0
?
The solution
Linear regression equation of age on systolic blood pressure
0 1
pressure age u =  + +
0 1
pressure dependent variable
age independent variable
, coefficients
u error term
=
=
  =
=
Slide 37
"Howto" in SPSS
Scales
Dependent variable: metric
Independent variable: metric
SPSS
AnalyzeRegressionLinear...
Result
Significant linear model
Significant coefficient
pressure 135.2 0.956 age = +
Predicted mean blood pressure
199.2 135.2 0.956 67 = +
Typical statistical statement in a paper:
There is a linear relation between age and systolic blood pressure.
(Regression: F = 102.763, R
2
= .93, p = .000).
S
y
s
t
o
l
i
c
b
l
o
o
d
p
r
e
s
s
u
r
e
[
m
m
H
G
]
Age [years]
140
150
160
170
180
190
200
210
220
230
240
35 40 45 50 55 60 65 70 75 80 85 90
Slide 38
General purpose of regression
Cause analysis
Is there a relationship between the independent variable and the dependent variable.
Example
Is there a model that describes the dependence between blood pressure and
age, or do these two variables just form a random pattern?
Impact analysis
Assess the impact of changing the independent variable to the value of dependent variable.
Example
If age increases, blood pressure also increases:
How strong is the impact? By how much will pressure increase with each additional year?
Prediction
Predict the values of a dependent variable using new values for the independent variable.
Example
Which is the predicted mean systolic blood pressure of men aged 67?
Slide 39
Key Steps in Regression Analysis
1. Formulation of the model
Common sense (remember the example with storks and babies)
Linearity of relationship plausible
Not too many variables (Principle of parsimony: Simplest solution to a problem)
2. Estimation of the model
Estimation of the model by means of OLS estimation (ordinary least squares)
Decision on procedure: Enter, stepwise regression
3. Verification of the model
Is the model as a whole significant? (i.e. are the coefficients significant as a group?)
Ftest
Are the regression coefficients significant?
ttests (should be performed only if Ftest is significant)
How much variation does the regression equation explain?
Coefficient of determination (adjusted Rsquared)
4. Considering other aspects (for example multicollinearity)
5. Testing of assumptions (GaussMarkov, independence and normal distribution)
6. Interpretation of the model and reporting
Slide 40
Regression model
Linear model
The linear model describes y as a function of x
=  +
0 1
y x equation of a straight line
The variable y is a linear function of the variable x.
0
(intercept)
The point where the line crosses the Yaxis. The value of the dependent variable when all of the
independent variables = 0.
1
(slope)
The increase in the dependent variable per unit change in
the independent variable (also known as the "rise over the run")
Stochastic model
0 1
y x u =  + +
The error term u comprises all factors (other than x) that affect y.
These factors are treated as being unobservable.
u stands for "unobserved"
1
y
x
A
 =
A
run
r
i
s
e
More details about mathematics
in Christof Luchsinger's part
Slide 41
Stochastic model Assumptions related to the error term
The error term u is (must be)
independent of the explanatory variable x
normally distributed with mean 0 and variance o
2
: u ~ N(0,o
2
)
0 1
E(y) x
W
o
o
l
d
r
i
d
g
e
J
.
(
2
0
0
5
)
,
I
n
t
r
o
d
u
c
t
o
r
y
E
c
o
n
o
m
e
t
r
i
c
s
:
A
M
o

d
e
r
n
A
p
p
r
o
a
c
h
,
3
e
d
i
t
i
o
n
,
S
o
u
t
h

W
e
s
t
e
r
n
C
o
l
l
e
g
e
P
u
b
S
u
b
s
e
q
u
e
n
t
i
m
a
g
e
s
h
a
v
e
s
a
m
e
s
o
u
r
c
e
Slide 42
GaussMarkov Theorem, Independence and Normal Distribution
Under the 5 GaussMarkov assumptions the OLS estimator is the best, linear, unbiased estima
tor of the true parameters
i
, given the present sample.
The OLS estimator is BLUE
1. Linear in coefficients y = 
0
+ 
1
x + u
2. Random sample of n observations {(x
i
,y
i
): i = 1,,n}
3. Zero conditional mean:
The error u has an expected value of 0,
given any values of the explanatory variable
E(ux) = 0
4. Sample variation in explanatory variables.
The x
i
s are not constant and not all the same.
x = const
x
1
= x
2
= = x
n
5. Homoscedasticity:
The error u has the same variance given any value of the
explanatory variable.
Var(ux) = o
2
Independence and normal distribution of error u ~ Normal(0,o
2
)
These assumptions need to be tested among else by analyzing the residuals.
Based on: Wooldridge J. (2005). Introductory Econometrics: A Modern Approach. 3
rd
edition, SouthWestern.
Slide 43
Regression analysis with SPSS: Some detailed examples
Simple example (EXAMPLE02)
Dataset EXAMPLE02.SAV:
Sample of 99 men by body size and weight
Step 1: Formulation of the model
Regression equation of weight on size
=  + +
0 1
weight size u
0 1
weight dependent variable
size independent variable
, coefficients
u error term
=
=
  =
=
Slide 44
Step 2: Estimation of the model
SPSS: AnalyzeRegressionLinear
:
Slide 45
Step 3: Verification of the model
SPSS Output (EXAMPLE02) Ftest
The null hypothesis (H
0
) to verify is that there is no effect on weight
The alternative hypothesis (H
A
) is that this is not the case
H
0
: 
0
= 
1
= 0
H
A
: at least one of the coefficients is not zero
Empirical Fvalue and the appropriate pvalue are computed by SPSS.
In the example, we can reject H
0
in favor of H
A
(Sig. < 0.05). This means that the estimated
model is not only a theoretical construct but one that exists and is statistically significant.
Slide 46
SPSS Output (EXAMPLE02) ttest
The Coefficients table also provides a significance test for the independent variable.
The significance test evaluates the null hypothesis that the unstandardized regression coefficient
for the predictor is zero while all other predictors' coefficients are fixed at zero.
H
0
: 
i
= 0, 
j
= 0, j=i
H
A
: 
i
= 0, 
j
= 0, j=i
The t statistic for the size variable (
1
) is associated with a pvalue of .000 ("Sig."), indicating that
the null hypothesis can be rejected. Thus, the coefficient is significantly different from zero.
This holds also for the constant (
0
) with Sig. = .000.
Slide 47
Step 6. Interpretation of the model
SPSS Output (EXAMPLE02) Regression coefficients
i 0 1 i
weight size =  +
i i
weight 120.375 1.086 size = +
Unstandardized coefficients show absolute
change of dependent variable weight if
dependent variable size changes one unit.
Note: The constant 120.375 has no
specific meaning. It's just the intersection
with the Y axis.
Slide 48
Back to Step 3: Verification of the model
SPSS Output (EXAMPLE02) Coefficient of determination
T
o
t
a
l
G
a
p
R
e
g
r
e
s
s
i
o
n
E
r
r
o
r
i
y
i
y
y
i
y
= Data point
i
y
= Estimation (model)
y
= Sample mean
Error is also called residual
Slide 49
SPSS Output (EXAMPLE02) Coefficient of determination I
Summing up distances
SS
Total
= SS
Regression
+ SS
Error
= = =
+ =
n
1 i
2
i i
n
1 i
2
i
n
1 i
2
i
) y y ( ) y y ( ) y y (
Regression
Total
SS
R Square = 0 R Square 1
SS
R Square, the coefficient of determination, is .546.
In the example, about half the variation of weight is explained by the model (R
2
= 54.6%).
In bivariate regression, R
2
is qual to the squared value of the correlation coefficient of the two
variables (r
xy
= .739, r
xy
2
= .546).
The higher the R Square, the better the fit.
Correlation
r
xy
= 0.739
(r
xy
)
2
= 0.546
Slide 50
Step 5: Testing of assumptions
In the example, are the requirements of the GaussMarkov theorem as well as the other as
sumptions met?
1. Is the model linear in coefficients Yes, decision for regression model.
2. Is it a random sample? Yes, clinical study.
3. Do the residuals have an expected value of 0
for all values of x? (zero conditional mean)
Scatterplot of residuals
4. Is there variation in the explanatory variable? Yes, clinical study.
5. Do the residuals have constant variance
for all values of x? (homoscedasticity)
Scatterplot of residuals
Are the residuals independent from one another?
Are the residuals normally distributed?
Scatterplot of residuals
(consider DurbinWatson)
Histogram
Slide 51
Scatterplot of standardized predicted values of y vs. standardized residuals
3. Zero conditional mean: The mean values of the residuals do not differ visibly from 0 across
the range of standardized estimated values. OK
5. Homoscedasticity: Residual plot trumpetshaped; residuals do not have constant variance.
This GaussMarkov requirement is violated. There is heteroscedasticity.
Independence: There is no obvious pattern that indicates that the residuals would be influenc
ing one another (for example a "wavelike" pattern). OK
Slide 52
Histogram of standardized residuals
Normal distribution of residuals:
Distribution of the standardized residuals is more or less normal. OK
Slide 53
Violation of the homoscedasticity assumption
How to diagnose heteroscedasticity
Informal methods:
Look at the scatterplot of standardized predicted yvalues vs. standardized residuals.
Graph the data and look for patterns.
Formal methods (not pursued further in this course):
BreuschPagan test / CookWeisberg test
White test
Corrections
Transformation of the variable:
Possible correction in the case of EXAMPLE01 is a log transformation of variable weight
Use of robust standard errors (not implemented in SPSS)
Use of Generalized Least Squares (GLS):
The estimator is provided with information about the variance and covariance of the errors.
(The last two options are not pursued further in this course.)
Slide 54
Multiple regression
Many similarities with simple Regression Analysis from above
Key steps in regression analysis
General purpose of regression
Mathematical model and stochastic model
Ordinary least squares (OLS) estimates and GaussMarkov theorem as well as independence
and normal distribution of error
All concepts are the same also regarding multiple regression analysis.
What is new?
Concept of multicollinearity
Concept of stepwise conduction of regression analysis
Dummy coding of categorical variables
Adjustment of the coefficient of determination ("Adjusted R Square")
Slide 55
Multicollinearity
Outline
Multicollinearity means there is a strong correlation between two or more independent variables.
Perfect collinearity means a variable is a linear combination of other variables.
Impossible to obtain unique estimates of coefficients because there are an infinite number of
combinations.
Perfect collinearity is rare in reallife data (except the fact that you make a mistake)
Correlations or even strong correlations between variables are unavoidable.
SPSS detects perfect collinearity and eliminates one of the redundant variables.
Example: x1 and x2 have perfect collinearity x1 is excluded automatically.
Slide 56
How to identify multicollinearity
If the correlation coefficients between pairs of variables are greater than 0.80, the variables
should not be used at the same time.
An indicator for multicollinearity reported by SPSS is Tolerance.
Tolerance reflects the percentage of unexplained variance in a variable, given the other in
dependent variables. Tolerance informes about the degree of independence of an independ
ent variable.
Tolerance ranges from 0 (= multicollinear) to 1 (= independent).
Rule of thumb (O'Brien 2007): Tolerance less than .10 problem with multicollinearity
Output from example on slide "Example of multicollinearity" on slide 64
In addition, SPSS reports the Variance Inflation Factor (VIF) which is simply the inverse of the
tolerance (1/tolerance). VIF has a range 1 to infinity.
Slide 57
Symptoms of multicollinearity
When correlation is strong, standard errors of the parameters become large
It is difficult or impossible to assess the relative importance of the variables
The probability is increased that a good predictor will be found nonsignificant and rejected
There might be large changes in parameter estimates when variables are added or removed
(Stepwise regression)
There might be coefficients with sign opposite of that expected
Multicollinearity is
a severe problem when the research purpose includes causal modeling
less important where the research purpose is prediction since the predicted values of
the dependent remain stable relative to each other
Some hints to deal with multicollinearity
Ignore multicollinearity if prediction is the only goal
Center the variables to reduce correlation with other variables
(Centering data refers to subtracting the mean (or some other value) from all observations)
Conduct partial least squares regression
Compute principal components (by running a factor analysis) and use them as predictors
With large sample sizes, the standard errors of the coefficients will be smaller
Slide 58
Multiple regression analysis with SPSS: Some detailed examples
Example of multiple regression (EXAMPLE04)
Dataset EXAMPLE04.SAV:
Sample of 198 men and women based on body size, weight and age
Formulation of the model
Regression of weight on size and age
 + + +
0 1 2
weight = size age u
  
0 1 2
weight = dependent variable
size =independent variable
age =independent variable
, , =coefficients
u = error term
Slide 59
SPSS Output regression analysis (EXAMPLE04)
Overall Ftest: OK (F = 487.569, p = .000) (table not shown here)
 + + +
0 1 2
weight = size age u
+ + weight = 85.933 .812 size .356 age
Unstandardized B coefficients show absolute change of the dependent variable weight if
the dependent variable size changes by one unit.
The Beta coefficients are the standardized regression coefficients.
Their relative absolute magnitudes reflect their relative importance in predicting weight.
Beta coefficients are only comparable within a model, not between. Moreover, they are highly
influenced by misspecification of the model.
Adding or subtracting variables in the equation will affect the size of the beta coefficients.
Slide 60
SPSS Output regression analysis (EXAMPLE04) I
R Square is influenced by the number of independent variables.
=> R Square increases with increasing number of variables.
n (1 R Square)
Adjusted R square = R square
n m 1
n =number of observations
m=number of independent variables
n m 1=degreesof freedom(df)
Choose Adjusted R square for the reporting.
Slide 61
Dummy coding of categorical variables
In regression analysis, a dummy variable (also called indicator or binary variable) is one that
takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may
be expected to shift the outcome.
For example, seasonal effects may be captured by creating dummy variables for each of the
seasons. Also gender effects may be treated with dummy coding.
The number of dummy variables is always one less than the number of categories.
Categorical variable
season season_1 season_2 season_3 season_4
If season = 1 (spring) 1 0 0 0
If season = 2 (summer) 0 1 0 0
If season = 3 (fall) 0 0 1 0
If season = 4 (winter) 0 0 0 1
Dummy variables
Categorical variable
gender gender_1 gender_2
If gender = 1 (male) 1 0
If gender = 2 (female) 0 1
Dummy variables
recode gender (1 = 1) (2 = 0) into gender_d.
SPSS syntax:
Slide 62
Gender as dummy variable
Women and men have different
mean levels of size and weight.
=> introduce gender as independent dummy variable
=> recode gender (1 = 1) (2 = 0) into gender_d.
Mean gender size weight
men 1 181.19 76.32
women 2 170.08 63.95
Total 175.64 70.14
Slide 63
SPSS Output regression analysis (EXAMPLE04)
Overall Ftest: OK (F = 553.586, p = .000) (table not shown here)
+ + + weight = 25.295 .417 size .476 age 8.345 gender_d
"Switching" from women (gender_d = 0) to men (gender_d = 1) raises weight by 8.345 kg.
Model fits better (Adjusted R square .894 vs. .832) because of the "separation" of gender.
Slide 64
Example of multicollinearity
Human resources research in hospitals: Survey of nurse satisfaction and commitment
Dataset Subsample of n = 198 nurses
Regression model
 + + + + +
2
0 1 2 3 4
salary = age education experience experience u
Why a new variable experience
2
?
The experience effect on salary is disproportional for younger and older people.
The disproportionality can be described by a quadratic term.
"experience" and "experience
2
"
are highly correlated!
Slide 65
SPSS Output regression analysis (Example of multicollinearity) I
Tolerance is very low for "experience" and "experience
2
"
One of the two variables might be eliminated from the model
=> Use stepwise regression? Unfortunately SPSS does not take into account multicollinearity.
Slide 66
SPSS Output regression analysis (Example of multicollinearity) II
Prefer this model,
because a not significant
constant is difficult to handle.
Slide 67
Exercises 02: Regression
Ressources => www.schwarzpartners.ch/ZNZ_2012 => Exercises Analysis => Exercise 02
Slide 68
Notes:
Slide 69
Analysis of Variance (ANOVA)
Example
Human resources research in hospitals: Survey of nurse salaries
1 2 3 All
All 36. 38. 42. 39.
Level of Experience
Nurse Salary [CHF/h]
Dataset (EXAMPLE05.sav)
Subsample of n = 96 nurses
Among other variables: work experience (3 levels) & salary (hourly wage CHF/h)
Typical Question
Has experience an effect on the level of salary?
Are the results just by chance?
What is the relation between work experience and salary?
grand mean
Slide 70
Boxplot
The boxplot indicates that salary may differ significantly depending on levels of experience.
   grand mean
Slide 71
Questions
Question in everyday language:
Has experience really an effect on salary?
Research question:
What is the relation between work experience and salary?
What kind of model is suitable for the relation?
Is analysis of variance the right model?
Statistical question:
Forming hypothesis
H
0
: "No model" (= Not significant factors)
H
A
: "Model" (= Significant factors)
Can we reject H
0
?
Solution
Linear model with salary as the dependent variable (y
gk
= wage of nurse k in group g)
gk g gk
y y = + o + c
g
gk
y grand mean
effect of group g
random term
=
o =
c =
Slide 72
"Howto" in SPSS
Scales
Dependent Variable: metric
Independent Variables: categorical (called factors), metric (called covariates)
SPSS
AnalyzeGeneral Linear ModelUnivariate...
Results
Total model significant ("Corrected Model": F(2, 93) = 46.193, p = .000). experien significant.
Example interpretation:
There is a main effect of experience (levels 1, 2, 3) on the salary, F(2, 93) = 46.193, p = .000.
The value of Adjusted R Squared = .488 shows that 48.8% of the variation in salary around the
grand mean can be predicted by the model (here: experien).
Slide 73
Key steps in analysis of variance
1. Design of experiments
ANOVA is typically used for analyzing the findings of experiments
Oneway ANOVA, Repeated measures ANOVA
Multifactorial ANOVA (two or more factor analysis of variance)
2. Calculating differences and sum of squares
Differences between group means, individual values and grand mean are squared and
summed up. This leads to the fundamental equation of ANOVA.
Test statistics for significance test is calculated from the means of the sums of squares.
3. Prerequisites
Data is Independent
Normally distributed variables
Homogeneity of variance between groups
4. Verification of the model and the factors
Is the overall model significant? (Ftest)? Are the factors significant?
Are prerequisites met?
5. Checking measures
Adjusted R squared / partial Eta squared
Mixed ANOVA
Slide 74
Designs of ANOVA
Oneway ANOVA: one factor analysis of variance
1 dependent variable and 1 independent factor
Multifactorial ANOVA: two or more factor analysis of variance
1 dependent variable and 2 or more independent factors
MANOVA: multivariate analysis of variance
Extension of ANOVA used to include more than one dependent variable
Repeated measures ANOVA
1 independent variable but measured repeatedly under different conditions
ANCOVA: analysis of COVariance
Model includes a so called covariate (metric variable)
MANCOVA: multivariate analysis of COVariances
Mixeddesign ANOVA possible (e.g. twoway ANOVA with repeated measures)
Slide 75
Sum of Squares
Stepbystep
Survey on hospital nurse salary: Salaries differ regarding the level of experience.
1 2 3
Guess: What if y y y ? ~ ~
S
a
l
a
r
y
[
C
H
F
/
h
]
y
38.6
41.6
42.7
35.9
y
S
a
l
a
r
y
[
C
H
F
/
h
]
y
38.6
41.6
42.7
35.9
y
Expand
y
y
3i
y
1 2 3
level of experience
mean of all nurses salary
38.6
3
y
mean of experience level 3
salary of ith nurse with experi ence level 3
41.6
42.7
35.9
1
y
A
B
Legend
indi vi dual nurse salaries
A
B
part of variation due to experi ence level
A+B
random part of vari ation
total variation from mean of all nurses
2
y
y
y
3i
y
1 2 3
level of experience
mean of all nurses salary
38.6
3
y
mean of experience level 3
salary of ith nurse with experi ence level 3
41.6
42.7
35.9
1
y
A
B
Legend
indi vi dual nurse salaries
A
B
part of variation due to experi ence level
A+B
random part of vari ation
total variation from mean of all nurses
2
y
y
y
3i
y
1 2 3
level of experience
mean of all nurses salary
38.6
3
y
mean of experience level 3
salary of ith nurse with experi ence level 3
41.6
42.7
35.9
1
y
A
B
Legend
indi vi dual nurse salaries
A
B
part of variation due to experi ence level
A+B
random part of vari ation
total variation from mean of all nurses
Legend
indi vi dual nurse salaries
A
B
part of variation due to experi ence level
A+B
random part of vari ation
total variation from mean of all nurses
2
y
Slide 76
Basic idea of ANOVA
Total sum of squared variation of salaries SS
total
is separated into two parts
(SS = Sum of Squares)
SS
between
Part of sum of squared variation due to groups ("between groups", treatments)
(here: between levels of experience)
SS
within
Part of sum of squared variation due to randomness ("within groups", also SS
error
)
(here: within each experience group)
Fundamental equation of ANOVA:
= = = = =
= +
g g
K K
G G G
2 2 2
gk g g gk g
g 1 k 1 g 1 g 1 k 1
(y y) K (y y) (y y )
total
SS
between
SS
within
SS
g: index for groups from 1 to G (here: G = 3 levels of experience)
k: index for individuals within each group from 1 to K
g
(here: K
1
= K
2
= K
3
= 32, K
total
= K
1
+ K
2
+ K
3
= 96 nurses)S
within
1 2 3 b w
If y y y then SS SS ~ ~
Slide 77
Significance testing
Test statistic F for significance testing is computed by relation of means of sum of squares
=
t
t
total
SS
MS
K 1
=
b
b
SS
MS
G 1
=
w
w
total
SS
MS
K G
Significance testing for the global model
b
w
MS
F
MS
=
The Ftest verifies the hypothesis that the group means are equal:
0 1 2 3
H : y y y = =
A i j
H : y y for at least one pair ij =
1 2 3 b w
If y y y then MS MS ~ ~
Mean of total sum of squared variation
Mean of squared variation between groups
Mean of squared variation within groups
F follows an Fdistribution with (G 1) and (K
total
G) degrees of freedom
Slide 78
ANOVA with SPSS: A detailed example
Example of oneway ANOVA: Survey of nurse salaries (EXAMPLE05)
SPSS: AnalyzeGeneral Linear ModelUnivariate...
Slide 79
SPSS Output ANOVA (EXAMPLE05) Tests of BetweenSubjects Effects I
Significant ANOVA model (called "Corrected Model")
Significant constant (called "Intercept")
Significant variable experien
Example interpretation:
There is a main effect of experience (levels 1, 2, 3) on the salary (F(2, 93) = 46.193 p = .000).
The value of Adjusted R Squared = .488 shows that 48.8% of the variation in salary around the
grand mean can be predicted by the variable experien.
Slide 80
SPSS Output ANOVA (EXAMPLE05) Tests of BetweenSubjects Effects II
Allocation of sum of squares to terms in the SPSS output
experien is part of SS
between
.
In this case (oneway analysis) experien = SS
between
"Grand mean"
SS
between
SS
total
SS
within
(= SS
error
)
Slide 81
Including Partial Eta Squared
Partial Eta Squared compares the amount of variance explained by a particular factor (all other
variables fixed) to the amount of variance that is not explained by any other factor in the model.
This means, we are only considering variance that is not explained by other variables in the
model. Partial q
2
indicates what percentage of this variance is explained by a variable.
q =
+
2 Effect
Effect Error
SS
Partial
SS SS
Example: Experience explains 49.8% of the previously unexplained variance.
Note: The values of partial q
2
do not sum up to 100%! ( "partial")
In case of oneway ANOVA:
Partial q
2
is the proportion of the corrected total variance
that is explained by the model (= R
2
).
Slide 82
TwoWay ANOVA
Human resources research: Survey of nurse salary
1 2 3 All
Office 35. 37. 39. 37.
Hospital 37. 40. 44. 40.
All 36. 38. 42. 39.
Level of Experience
Nurse Salary [CHF/h]
P
o
s
i
t
i
o
n
Now two factors are in the design
Level of experience
Position
Typical Question
Does position and experience have an effect on salary?
What "interaction" exists between position and experience??
Slide 83
Interaction
Interaction means there is dependency between experience and position.
The independent variables have a complex influence on the dependent variable (salary).
The complex influence is called interaction.
The independent variables do not explain all of the variation of the dependent variable.
Part of the variation is due to the interaction term.
An interaction means that the effect of a factor depends on the value of another factor.
experience
(factor A)
salary
interaction
(factor A x B)
position
(factor B)
Slide 84
Sum of Squares
Again SS
total
= SS
between
+ SS
within
With SS
between
= SS
Experience
+ SS
Position
+ SS
Experience x Position
Follows SS
total
= (SS
Experience
+ SS
Position
+ SS
Experience x Position
) + SS
within
Where SS
Experience x Position
is interaction of both factors simultaneously
Sum of variation
between groups
SS
b
Total sum of variation
SS
t
Sum of variation
withi n groups
SS
w
Sum of variation
due to factor A
SS
A
Sum of variation
due to factor B
SS
B
Sum of variation due to
interaction of A x B
SS
AxB
Slide 85
Example of twoway ANOVA: Survey of nurse salary (EXAMPLE06)
SPSS: AnalyzeGeneral Linear ModelUnivariate...
Slide 86
Interaction
Interaction term between fixed factors is given by default in ANOVA
Example interpretation (among other duty descriptions):
There is also an interaction of experience and position on the salary
(F(2, 90) = 34.606 p = .000).
The interaction term experien * position explains 29.7% of the variance
Slide 87
Interaction I
Do different levels of experience influence the impact of different levels of position differently?
Yes, if experience has values 2 or 3 then the influence of position is raised.
Simplified: Lines not parallel
Interpretation: Experience is more important in hospitals than in offices.
office
hospital
Slide 88
More on interaction
Main effect of experien
Main effect of position
Interaction
Main effect of experien
Main effect of position
Interaction
Main effect of experien
Main effect of position
Interaction
Main effect of experien
Main effect of position
Interaction
Main effect of experien
Main effect of position
Interaction
Main effect of experien
Main effect of position
Interaction
s
a
l
a
r
y
s
a
l
a
r
y
s
a
l
a
r
y
experien experien experien
s
a
l
a
r
y
experien
s
a
l
a
r
y
experien
s
a
l
a
r
y
experien
Slide 89
Requirements of ANOVA
0. Robustness
ANOVA is relatively robust against violations of prerequisites.
1. Sampling
Random sample, no treatment effects (more in Lecture 10)
A well designed study avoids violation of this assumption
2. Distribution of residuals
Residuals (= error) are normally distributed
Correction transformation
3. Homogeneity of variances
Residuals (= error) have constant variance (more in Lecture 10)
Correction weight variances
4. Balanced design
Same sample size in all groups
Correction weight mean
SPSS automatically corrects unbalanced designs by Sum of Squares "Type III"
Syntax: /METHOD = SSTYPE(3)
Slide 90
Exercises 03: ANOVA
Ressources => www.schwarzpartners.ch/ZNZ_2012 => Exercises Analysis => Exercise 03
Slide 91
Other multivariate Methods
Type of Multivariate Statistical Analysis
Regarding the practical application multivariate methods can be divided into two main parts:
Methods for identifying structures Methods for discovering structures
Dependence Analysis
(directed dependencies)
Independent
Variable (IV)
Price of
product
Dependent
Variable(s) (DV)
Quality of
Products
Quality of
customer service
Customer
satisfaction
Interdependence Analysis
(nondirected dependencies)
Customer
satisfaction
Employee
satisfaction
Moti vation of
employee
Also called dependence analysis be
cause methods are used to test direct
dependencies between variables.
Variables are divided into independent
variables and dependent variable(s).
Also called interdependence analysis
because methods are used to discover
dependencies between variables.
This is especially the case with explora
tory data analysis (EDA).
Slide 92
Choice of Method
Methods for identifying structures
(Dependence Analysis)
Regression Analysis
Analysis of Variance (ANOVA)
Discriminant Analysis
Contingency Analysis
(Conjoint Analysis)
Methods for discovering structures
(Interdependence Analysis)
Factor Analysis
Cluster Analysis
Multidimensional Scaling (MDS)
Independent Variable (IV)
metric categorical
Dependent Variable
(DV)
metric Regression analysis Analysis of Variance (ANOVA)
categorical Discriminant analysis Contingency analysis
Slide 93
Tree of methods (also www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm, July 2012)
(See also www.methodenberatung.uzh.ch (in German))
Data Analysis
Descriptive Inductive
Univariate Bivariate Multivariate
Correl ation tTest
_2 Independence
tTest
_2 Adjustment
Dependence Interdependence
DV metric DV not metric
IV not metric IV metric IV not metric IV metric
not metric metric
Regression ANOVA
Conj oint
Discriminant Contingency
Cluster
Factor
MDS
Univariate Bivariate
DV = dependent variable
IV = independent variable
Slide 94
Example of multivariate Methods (categorical / metric)
Linear discriminant analysis
Linear discriminant analysis (LDA) is used to find the linear combination of features which
best separates two or more groups in a sample.
The resulting combination may be used to classify groups in a sample.
(Example: Credit card debt, debt to income ratio, income => predict bankrupt risk of clients)
LDA is closely related to ANOVA and logistic regression analysis, which also attempt to express
one dependent variable as a linear combination of other variables.
LDA is an alternative to logistic regression, which is frequently used in place of LDA.
Logistic regression is preferred when data are not normal in distribution or group sizes
are very unequal.
Slide 95
Example of linear discriminant analysis
Data from measures of body length of
two subspecies of puma (South & North America)
100
105
110
115
120
125
130
135
140
150 160 170 180 190 200 210 220 230 240 250
x1 [cm]
x
2
[
c
m
]
Species x1 x2
1 191 131
1 185 134
1 200 137
1 173 127
1 171 118
1 160 118
1 188 134
1 186 129
1 174 131
1 163 115
2 186 107
2 211 122
2 201 114
2 242 131
2 184 108
2 211 118
2 217 122
2 223 127
2 208 125
2 199 124
Species 1 = North America, 2 = South America
x1 body length: nose to top of tail
x2 body length: nose to root of tail
Other names for puma
cougar
mountain lion
catamount
panther
Slide 96
Very short introduction to linear discriminant analysis
Dependent Variable (also called discriminant variable): categorical
Puma's example: type (two subspecies of puma)
Independent Variables: metric
Puma's example: x1 & x2 (different measures of body length)
Goal
Discrimination between groups
Puma's example: discrimination between two subspecies
Estimate a function for discriminating between group
i 1 i,1 2 i,2 i
Y =+ x + x +u
i
1 2
i,1 i,2
i
Y discriminant variable
, , coefficients
x , x measurement of body lenght
u error term
Sketch of LDA
Slide 97
Data from measurement of bodylength of two subspecies of puma
100
105
110
115
120
125
130
135
140
150 160 170 180 190 200 210 220 230 240 250
x1 [cm]
x
2
[
c
m
]
100
105
110
115
120
125
130
135
140
150 160 170 180 190 200 210 220 230 240 250
x1 [cm]
x
2
[
c
m
]
Slide 98
SPSSExample of linear discriminant analysis (EXAMPLE07)
DISCRIMINANT
/GROUPS=species(1 2)
/VARIABLES=x1 x2
/ANALYSIS ALL
/PRIORS SIZE
/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW TABLE
/CLASSIFY=NONMISSING POOLED MEANSUB .
Slide 99
SPSS Output Discriminant analysis (EXAMPLE07) I
Both coefficients significant
i 1 i,1 2 i,2 i
Y =+ x + x +
i i,1 i,2 i
Y =4.588+.131x .243x +
Slide 100
5
4
3
2
1
0
1
2
3
4
5
1 1 1 1 1 1 A 1 1 1 1 2 2 2 2 2 2 2 B 2 2 2
subspecies of puma [0,1]
d
i
s
c
r
i
m
i
n
a
n
t
v
a
r
i
a
b
l
e
Y
x1 x2
A 175 120
B 200 110
The two subspecies of pumas can be com
pletely classified (100%)
See also plot above that is generated with
i i,1 i,2 i
Y =4.588+.131x .243x +
"Found" two pumas A & B:
x1 x2
A 175 120
B 200 110
What subspecies are they?
Use
i i,1 i,2 i
Y =4.588+.131x .243x +
to determine their subspecies.
Slide 101
Another example
Hence the word "Discrimination"
Wason Wanchakorn / AP
Slide 102
Notes: