Adv Stats 14

Advanced Statistical
Analysis Using SPSS®

31832-001
V14.0 Revised 02/01/06 sg/mr/ls/ss

For more information about SPSS® software products, please visit our Web site at
http://www.spss.com or contact
SPSS Inc.
233 South Wacker Drive, 11th Floor
Chicago, IL 60606-6412
Tel: (312) 651-3000
Fax: (312) 651-3668
SPSS is a registered trademark and its other product names are the trademarks of SPSS Inc. for its
proprietary computer software. No material describing such software may be produced or distributed
without the written permission of the owners of the trademark and license rights in the software and the
copyrights in the published materials.
The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or
disclosure by the Government is subject to restrictions as set forth in subdivision (c)(1)(ii) of The Rights in
Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233
South Wacker Drive, 11th Floor, Chicago, IL 60606-6412.
TableLook is a trademark of SPSS Inc.
Windows is a registered trademark of Microsoft Corporation.
DataDirect, DataDirect Connect, INTERSOLV, and SequeLink are registered trademarks of MERANT
Solutions Inc.
Portions of this product were created using LEADTOOLS © 1991-2000, LEAD Technologies, Inc. ALL
RIGHTS RESERVED.
LEAD, LEADTOOLS, and LEADVIEW are registered trademarks of LEAD Technologies, Inc.
Portions of this product were based on the work of the FreeType Team (http:\\www.freetype.org).
General notice: Other product names mentioned herein are used for identification purposes only and may
be trademarks or registered trademarks of their respective companies in the United States and other
countries.
Advanced Statistical Analysis Using SPSS

Copyright © 2006 by SPSS Inc.
All rights reserved.
Printed in the United States of America.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or
by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written
permission of the publisher.
Table of Contents
CHAPTER 1 INTRODUCTION AND OVERVIEW

INTRODUCTION .........................................................................................................................1-1
COURSE GOALS .........................................................................................................................1-1
TAXONOMY OF METHODS.........................................................................................................1-2
GENERAL APPROACH ................................................................................................................1-4
CHAPTER 2 DISCRIMINANT ANALYSIS

HOW DOES DISCRIMINANT ANALYSIS WORK? ........................................................................2-2
THE ELEMENTS OF A DISCRIMINANT ANALYSIS ......................................................................2-3
THE DISCRIMINANT MODEL .....................................................................................................2-4
HOW CASES ARE CLASSIFIED ...................................................................................................2-4
ASSUMPTIONS OF DISCRIMINANT ANALYSIS ...........................................................................2-5
ANALYSIS TIPS..........................................................................................................................2-6
A TWO-GROUP DISCRIMINANT EXAMPLE ................................................................................2-7
CHECKING VARIANCE ASSUMPTION ........................................................................................2-7
RUNNING A DISCRIMINANT ANALYSIS .....................................................................................2-9
DISCRIMINANT COEFFICIENTS ................................................................................................2-14
CLASSIFICATION STATISTICS ..................................................................................................2-15
PREDICTION ............................................................................................................................2-17
ASSUMPTION OF EQUAL COVARIANCE ...................................................................................2-18
MODIFYING THE LIST OF PREDICTORS ...................................................................................2-19
CASEWISE STATISTICS AND OUTLIERS ...................................................................................2-20
ADJUSTING PRIOR PROBABILITIES..........................................................................................2-23
VALIDATING THE DISCRIMINANT MODEL ..............................................................................2-24
STEPWISE MODEL SELECTION ................................................................................................2-26
THREE–GROUP DISCRIMINANT ANALYSIS .............................................................................2-27
CHAPTER 3 BINARY LOGISTIC REGRESSION

HOW LOGISTIC REGRESSION WORKS .......................................................................................3-2
THE LOGISTIC EQUATION .........................................................................................................3-3
ELEMENTS OF LOGISTIC REGRESSION ANALYSIS .....................................................................3-3
ASSUMPTIONS OF LOGISTIC REGRESSION.................................................................................3-4
LOGISTIC REGRESSION EXAMPLE: LOW BIRTH WEIGHT..........................................................3-5
ACCURACY OF PREDICTION ....................................................................................................3-11
INTERPRETING LOGISTIC REGRESSION COEFFICIENTS ...........................................................3-12
MAKING PREDICTIONS ............................................................................................................3-14
ESTIMATED PROBABILITIES ....................................................................................................3-14
Table of Contents - 1
CHECKING CLASSIFICATIONS .................................................................................................3-15

RESIDUAL ANALYSIS ..............................................................................................................3-17
STEPWISE LOGISTIC REGRESSION ...........................................................................................3-21
ROC CURVES ..........................................................................................................................3-25
APPENDIX: COMPARISON TO DISCRIMINANT ANALYSIS ........................................................3-29
CHAPTER 4 MULTINOMIAL LOGISTIC REGRESSION

MULTINOMIAL LOGISTIC MODEL .............................................................................................4-2
A MULTINOMIAL LOGISTIC ANALYSIS: PREDICTING CREDIT RISK .........................................4-4
INTERPRETING COEFFICIENTS .................................................................................................4-12
CLASSIFICATION TABLE .........................................................................................................4-14
MAKING PREDICTIONS ............................................................................................................4-15
APPENDIX: MULTINOMIAL LOGISTIC WITH A TWO-CATEGORY OUTCOME...........................4-17
CHAPTER 5 SURVIVAL ANALYSIS

WHAT IS SURVIVAL ANALYSIS? ...............................................................................................5-2
CONCEPTS .................................................................................................................................5-2
CENSORING ...............................................................................................................................5-3
WHAT TO LOOK FOR IN SURVIVAL ANALYSIS ..........................................................................5-4
SURVIVAL PROCEDURES IN SPSS.............................................................................................5-5
AN EXAMPLE: KAPLAN-MEIER ................................................................................................5-5
COX REGRESSION ...................................................................................................................5-14
AN EXAMPLE: COX REGRESSION............................................................................................5-14
CHECKING THE PROPORTIONAL HAZARDS ASSUMPTION .......................................................5-22
APPENDIX: BRIEF EXAMPLE OF COX REGRESSION WITH A TIME-VARYING COVARIATE ......5-24
CHAPTER 6 CLUSTER ANALYSIS

HOW DOES CLUSTER ANALYSIS WORK?..................................................................................6-2
DATA TYPES IN CLUSTERING....................................................................................................6-2
WHAT TO LOOK FOR WHEN CLUSTERING ................................................................................6-2
METHODS ..................................................................................................................................6-5
HIERARCHICAL METHODS ........................................................................................................6-5
NON-HIERARCHICAL METHOD: K-MEANS CLUSTERING .........................................................6-6
NON-HIERARCHICAL METHOD: TWOSTEP CLUSTERING ..........................................................6-7
DISTANCE AND STANDARDIZATION .........................................................................................6-9
OVERALL RECOMMENDATIONS ..............................................................................................6-10
EXAMPLE I: HIERARCHICAL CLUSTERING OF PRODUCT DATA .............................................6-11
CLUSTER RESULTS ..................................................................................................................6-15
EXAMPLE II: K-MEANS CLUSTERING OF USAGE DATA..........................................................6-23
K-MEANS RESULTS ................................................................................................................6-26
EXAMPLE III: TWOSTEP CLUSTERING OF TELECOM DATA ....................................................6-29
CHAPTER 7 FACTOR ANALYSIS

USES OF FACTOR ANALYSIS .....................................................................................................7-1
WHAT TO LOOK FOR WHEN RUNNING FACTOR ANALYSIS ......................................................7-2
PRINCIPLES ...............................................................................................................................7-2
THE IDEA OF A PRINCIPAL COMPONENT ...................................................................................7-3
FACTOR ANALYSIS VERSUS PRINCIPAL COMPONENTS .............................................................7-3
NUMBER OF FACTORS ...............................................................................................................7-4
ROTATION .................................................................................................................................7-5
FACTOR SCORES AND SAMPLE SIZE .........................................................................................7-6
METHODS ..................................................................................................................................7-6
OVERALL RECOMMENDATIONS ................................................................................................7-7
AN EXAMPLE: 1988 OLYMPIC DECATHLON PERFORMANCES ..................................................7-7
PRINCIPAL COMPONENTS WITH ORTHOGONAL ROTATION ......................................................7-9
PRINCIPAL AXIS FACTORING WITH AN OBLIQUE ROTATION ..................................................7-17
ADDITIONAL CONSIDERATIONS ..............................................................................................7-20
CHAPTER 8 LOGLINEAR MODELS

WHAT ARE LOGLINEAR MODELS? ...........................................................................................8-1
RELATIONS AMONG LOGLINEAR, LOGIT MODELS, AND LOGISTIC REGRESSION .....................8-3
WHAT TO LOOK FOR IN LOGLINEAR AND LOGIT ANALYSIS.....................................................8-4
ASSUMPTIONS ...........................................................................................................................8-4
PROCEDURES IN SPSS THAT RUN LOGLINEAR OR LOGIT ANALYSIS .......................................8-5
MODEL SELECTION EXAMPLE: LOCATION PREFERENCE..........................................................8-6
APPENDIX: LOGIT ANALYSIS WITH SPECIFIC MODEL (GENLOG) ...........................................8-17
CHAPTER 9 MULTIVARIATE ANALYSIS OF VARIANCE

WHY PERFORM MANOVA?.....................................................................................................9-1
MANOVA ASSUMPTIONS ........................................................................................................9-2
WHAT TO LOOK FOR IN MANOVA ..........................................................................................9-3
EXAMPLE: MEMORY INFLUENCES ............................................................................................9-4
POST HOC TESTS .....................................................................................................................9-13
CHAPTER 10 REPEATED MEASURES ANALYSIS OF VARIANCE

WHY DO A REPEATED MEASURES STUDY?............................................................................10-1
ASSUMPTIONS .........................................................................................................................10-4
EXAMPLE: ONE-FACTOR DRUG STUDY ..................................................................................10-5
EXAMINING RESULTS ...........................................................................................................10-10
FURTHER ANALYSES.............................................................................................................10-15
PLANNED COMPARISON ........................................................................................................10-17
REPEATED MEASURES WITH MISSING DATA........................................................................10-18
APPENDIX: AD VIEWING WITH PRE-POST BRAND RATINGS ................................................10-22
REFERENCES ......................................................................................... R-1
EXERCISES.............................................................................................. E-1
Chapter 1
Introduction and Overview
Topics
• Course Goals
• Taxonomy of Methods
• General Approach
Introduction
The material in this course consists of a survey of many of the advanced statistical procedures
available in the SPSS Base, Regression Models, and Advanced Models modules. In this chapter
we describe the goals of the course and the level at which the presentation will be made. Also, we
will present a framework for the various analyses in order to provide guidance in choosing the
appropriate method to answer a question of interest. Finally, we outline what sorts of things we
will discuss in detail and the types of analyses (exploratory) that we assume have been done.
Course Goals
The goal of this course is to prepare you to choose the best statistical method and effectively
apply it to answer a question. First we hope to provide a framework for these advanced statistical
procedures so you can choose, or advise others to choose, the most appropriate statistical method
to address a research question. Typically this is a function of the question asked and the nature of
the data. Second, for each statistical method we want you to understand assumptions underlying
the analysis and to know just how much they matter when drawing conclusions.
Recommendations might be based on theoretical arguments, Monte Carlo simulation results, and
experience. Next, you should be able to run these analyses using SPSS for Windows and interpret
the results, knowing which portions of the output deserve the most attention. Finally, for analyses
that are not ends in themselves, you should know the next steps to take.
As you probably know, many of the individual topics covered here could be profitably discussed
over several days. In fact, full semester or two semester classes for some of these topics (loglinear
models, survival analysis, and multivariate analysis of variance) are offered at universities. In the
short amount of time available, we cannot hope to approach this type of coverage. Rather we will
take a practical and pragmatic approach to the statistical methods discussed here. We will not
mathematically derive the key equations from first assumptions, but will review them as they are
practiced. There is great value in understanding the mathematical theory behind these techniques,
but we can’t accomplish this in so little time, and there are effective alternative means of doing
this (university courses, short courses given at association meetings, textbooks). Many of the
books in the reference section cover the individual topics in depth.
Introduction and Overview 1 - 1

Taxonomy of Methods
The statistical method chosen for an analysis is usually a function of the question being asked and
the nature of the measurements made. Below we place statistical methods involving predictive
relationships within a framework of scale (measurement) properties of the independent
(predictor) and dependent (outcome) measures. The diagram presents three types of measurement
scales (interval, ordinal, and nominal (categorical)). Recall that for nominal scales the numeric
score simply represents a category. In ordinal scales, the data values reflect the relative position
(ordering or rank) of the object measured. For interval scales, a one-unit change has the same
interpretation (measures the same change) anywhere on the scale. Elaborate refinements on this
table exist and for a much more extensive classification system see Andrews et al. (1981).
Table 1.1 Statistical Methods Examining Relationships

Independent Variable(s)
Nominal Ordinal Interval
Crosstabs (Chi-Square)
Discriminant*
Nominal Loglinear & Logit* Monotonic Regression
Logistic Regression*
CHAID (decision tree)
Dependent
Variable Nonparametric Tests Nonparametric
Ordinal Ordinal Regression Correlation Ordinal Regression
Monotonic Regression Monotonic Regression
T-Test Regression
Interval ANOVA Nonlinear Regression
Kaplan-Meier* Cox Regression*
* Covered in Advanced Statistics Using SPSS
We will discuss several of these techniques in this class and they are all available within SPSS, its
add-on options, and third party products. Each cell has several methods listed and the distinctions
will be fleshed out in the course. Here we will only make a few overview comments about the
major categories covered in this course.
Independent Categorical - Dependent Categorical

Crosstabulation tables with associated chi-square tests examine relations between two categorical
variables. Loglinear models extend this to analyze relations among two or more categorical
variables. If one of the categorical variables is considered a dependent measure to be predicted
from the others, then logit models (related to loglinear models) are applied. Since the number of
cells in a multi-way table rapidly increases as the number of variables increase, there are practical
limits, unless the sample is very large, to the number of variables that can be used in a logit
model. CHAID (Chi-square automatic interaction detection) analysis provides a heuristic method
for investigating which of a large number of predictive categorical variables most strongly relate
to the categorical dependent measure.
Independent Interval - Dependent Categorical

Analyses involving interval scale predictors and categorical outcome measures can be viewed as
classification problems: how can we best predict into which group an individual will fall. Logistic
regression and discriminant analysis are used for such applications as credit scoring and risk
assessment in health studies.

Independent Categorical - Dependent Interval

Analysis of variance investigates group mean differences for a single dependent variable.
Multivariate analysis of variance tests for group mean differences on several related dependent
measures simultaneously. Survival analysis, using Kaplan-Meier methods, is used to compare
different groups (treatment groups, demographic groups) on survival time when the data may be
censored: individuals might dropout or survive beyond the end of the study, so the final survival
time is not known.
Independent Interval - Dependent Interval

Most of you are already familiar with linear regression in which relationships between an interval
scale dependent measure and interval scale predictors are studied. Linear regression assumes that
the dependent measure is a linear function of the parameters (regression coefficients). Nonlinear
regression extends this model to permit nonlinear relations between the dependent measure and
the parameters. Survival analysis, using Cox regression, investigates survival time as a function
of interval scale predictors allowing, as does Kaplan-Meier, for censored data.
Dependent Ordinal
Ordinal regression, in which a function of the cumulative probability of falling in an ordered
category is modeled, can be applied to an ordinal outcome measure when the predictors are
nominal and/or interval. Typically these models are applied when the dependent measure has a
limited number of values with a clear ordering (for example, severity of illness). The SPSS
Ordinal Regression, within the Advanced Models option, can perform such analyses.
Exploring Relations and Natural Groupings

Other statistical methods examine relations among variables or natural groupings of observations.
Such methods are considered exploratory since you are usually exploring the data as opposed to
testing specific hypotheses. Factor and principal components analyses investigate common
components among interval scale variables by fitting their correlation pattern. Both are used as
data reduction methods since, when successful, they facilitate creation of new factor or
component variables that represent the shared aspects of multiple variables. Cluster analysis
investigates whether there are any natural groupings of observations based on proximity of scores
on the cluster variables. In marketing it is often used for market segmentation. For these methods
there is no declaration of dependent or independent variables. Each variable has the same status
as common relationships or groupings are studied.
Table 1.2 Exploring Relations and Natural Groupings

Groupings/Relations of Variables Groupings of Cases
(Interval Scale) (Scale depends on clustering method)
Factor Analysis Cluster Analysis
Principal Components Analysis
Hybrids
Your data situation may involve a combination of interval and categorical predictor variables.
Some of the methods assuming interval scale predictors (regression, Cox regression, nonlinear
regression and logistic) permit inclusion of categorical predictors if those with more than two

categories are first converted into a “dummy code” format. The simplest dummy coding creates,
for a k category variable, k-1 dummy variables, and each records whether (coded 1) or not (coded
0) an observation falls within the kth category. Similarly, interval scales can sometimes be
collapsed into meaningful categories; this is sometimes done when all but a few variables are
categorical. Also, some methods (for example, the TwoStep cluster method, which clusters based
on interval and nominal variables) formally incorporate mixed types of measures.
General Approach
For each statistical method discussed, we try to motivate the analysis with sample applications,
discuss the assumptions and check them when feasible, perform the analysis, and interpret the
results. In this class we will not generally perform exploratory data analysis before applying
formal methods. This is only because of time constraints given the number of methods we cover.
We strongly recommend that you check and explore your data before performing these inferential
methods. We follow this recommendation in our own work and detail the methods in other
classes (for example, Statistical Analysis Using SPSS, Advanced Techniques: Regression,
Advanced Techniques: ANOVA, and Data Mining: Modeling).
Most of the statistical procedures have many options. The most important and commonly used
variations will be discussed in detail, but not every option will be considered. To investigate these
features use either the on-line help system within SPSS or turn to the written discussion contained
in the SPSS manuals. Finally, recall that some infrequently used options are only available using
SPSS syntax; the SPSS manuals document such features and would be the best help.

Chapter 2
Discriminant Analysis
Topics
• How Does Discriminant Analysis Work?
• The Elements of a Discriminant Analysis
• The Discriminant Model
• How Cases are Classified
• Assumptions of Discriminant Analysis
• Analysis Tips
• A Two–Group Discriminant Example
• Extending the Analysis
• Validating the Discriminant Model
• Stepwise Model Selection
• Three–Group Discriminant Analysis
Introduction
Discriminant analysis is a technique designed to characterize the relationship between a set of
variables, often called the response or predictor variables, and a grouping variable with a
relatively small number of categories. To do so, discriminant creates a linear combination of the
predictors that best characterizes the differences among the groups. The technique is related to
both regression and multivariate analysis of variance, and as such it is another general linear
model technique. Another way to think of discriminant analysis is as a method to study
differences between two or more groups of cases on several variables simultaneously.
Generally speaking, there are two types of discriminant analysis. In predictive discriminant
analysis (PDA), the focus is “formulating a rule by which prediction of, or identification with,
group membership for a given unit is determined” (Huberty, 1994). Descriptive discriminant
analysis (DDA) uses the grouping variable, instead, as the explanatory variable and attempts to
study relationships between it and the response variables. In this, it has much in common with
multivariate analysis of variance (MANOVA). In this chapter we will discuss PDA, which is the
most common use of discriminant in applied settings.
One of the chief advantages of discriminant analysis is that the grouping variable can be
measured on any scale. In many applications, the grouping variable is measured on a nominal
scale. Although, by necessity, a discriminant function must be calculated from existing data, a
majority of discriminant applications involve predicting group membership in new data when
membership is currently unknown. Common uses of discriminant include:
1. Deciding whether a bank should offer a loan to a new customer.
Discriminant Analysis 2 - 1
2. Determining which customers are likely to buy a company’s products.

3. Classifying prospective students into groups based on their likelihood of success at a
school.
4. Identifying patients who may be at high risk for problems after surgery.
The discriminant analysis grouping variable can have two or more categories. When the variable
is nominal or ordinal and has three or more categories, discriminant is often used in place of
binary logistic regression (see Chapter 3, and also multinomial logistic regression in Chapter 4).
How Does Discriminant Analysis Work?

Discriminant analysis assumes that the population of interest is composed of separate and distinct
populations, as represented by the grouping variable. Furthermore, we assume that each
population is measured on a set of variables—the predictors—that follow a multivariate normal
distribution. Discriminant attempts to find the linear combinations of the predictors that best
separate the populations. If we assume two predictor variables, X and Y, and two groups for
simplicity, this situation can be represented as in Figure 2.1.
Figure 2.1 Two Normal Populations and Two Predictor Variables, with Discriminant Axis
The two populations or groups clearly differ in their mean values on both the X and Y-axes.
However, the linear function—in this instance, a straight line—that best separates the two groups
is a combination of the X and Y values, as represented by the line running from lower left to
upper right in the scatterplot. This line is a graphic depiction of the discriminant function, or
linear combination of X and Y, that is the best predictor of group membership. In this case with
two groups and one function, discriminant will find the midpoint between the two groups that is
the optimum cutoff for separating the two groups (represented here by the short line segment).
The discriminant function and cutoff can then be used to classify new observations.
If there are more than two predictors, then the groups will (hopefully) be well separated in a
multidimensional space, but the principle is exactly the same. If there are more than two groups,
more than one classification function can be calculated, although not all the functions may be
needed to classify the cases. Since the number of predictors is almost always more than two,
scatterplots such as Figure 2.1 are not always that helpful. Instead, plots are often created using
the new discriminant functions, since it is on these that the groups should be well separated.
The effect of each predictor on each discriminant function can be determined, and the predictors
can be identified that are more important or more central to each function. Nevertheless, unlike in
regression, the exact effects of the predictors are not typically seen as of ultimate importance in
PDA. Given the primary goal of correct prediction, the specifics of how this is accomplished are
not as critical as the prediction itself (such as offering loans to customers who will pay them
back). Second, as will be demonstrated below, the predictors do not directly predict the grouping
variable, but instead a value on the discriminant function, which, in turn, is used to generate a
group classification.
The Elements of A Discriminant Analysis

The general procedure of doing a predictive discriminant analysis can be outlined.
1. A grouping variable must be defined whose categories are exhaustive and mutually
exclusive.
2. A set of potential predictors must be selected. This is one of the most important steps,
although in many real-world applications, the set of predictors will be limited by what is
available in existing datasets.
3. Once the above two steps are accomplished, as with any multivariate technique, the next
job is to study your data to see if it meets the assumptions of doing a discriminant
analysis (to be discussed below). It is also important to look for outliers and unusual
patterns in the data, and to look for variables that might not be good predictors.
Univariate ANOVAs and correlations can be used to identify such variables.
4. The goal of a PDA is to correctly classify cases into the appropriate group. Given this, as
with any multivariate technique, parsimony is an important subgoal. This means using the
fewest predictors needed for accurate classification, although not necessarily the smallest
set of classification functions. Fewer predictors will mean lower costs of data collection
and easier interpretation.
5. You then specify the discriminant analysis and run it with SPSS. A method of model
selection must be chosen, and prior probabilities for group membership should be
considered. A significance test is available to see whether differences in group means on
each function are due to chance or not. The relative importance (in terms of explained
variance) of each function is also calculated.
6. You next typically turn to the classification results to see how well cases have been
placed in their known groups.
7. At least two statistics are available to examine the effect of individual predictors on the
discriminant functions and, in particular, to decide whether a particular variable adds
little to the classification ability of the model.
8. You might next look for outliers in the data and examine cases that have been
misclassified to check for problems and to see if and how the model can be respecified.
9. Finally, it is of the utmost importance that the model be validated by some procedure.
SPSS supplies two methods to do this in the Discriminant procedure.
The Discriminant Model

The discriminant model has the following mathematical form for each function:
FK = D0 + D1X1 + D2X2 + ... DpXp
where FK is the score on function K, the Di’s are the discriminant coefficients, and the Xi’s are the
predictor or response variables (there are p predictors). The maximum number of functions K that
can be derived is equal to the minimum of the number of predictors (p) or the quantity (number of
groups – 1). In most applications, there will be more predictors than categories of the grouping
variable, so the latter will limit the number of functions. For example, if we are trying to predict
which customers will choose one of three offers, (3-1), or two classification functions can be
derived.
When more than one function is derived, each subsequent function is chosen to be uncorrelated,
or orthogonal, to the previous functions (just as in principal components analysis, where each
component is uncorrelated with all others). This allows for straightforward partitioning of
variance.
Discriminant creates a linear combination of the predictor variables to calculate a discriminant

score for each function. This score is used, in turn, to classify cases into one of the categories of
the grouping variable.
How Cases Are Classified

There are three general types of methods to classify cases into groups.
1. Maximum likelihood or probability methods: These techniques assign a case to group k if

its probability of membership is greater for group k than for any other group. These
probabilities are posterior probabilities, as defined below. This method relies upon
assumptions of multivariate normality to calculate probability values.
2. Linear classification functions: These techniques assign a case to group k if its score on
the function for that group is greater than its score on the function for any other group.
This method was first suggested by Fisher, so these functions are often called Fisher
classification functions (which is how SPSS refers to them).
3. Distance functions: These techniques assign a case to group k if its distance to that
group’s centroid is smaller than its distance to any other group’s centroid. Typically, the
Mahalanobis distance is the measure of distance used in classification.
When the assumption of equal covariance matrices is met, all three methods give equivalent
results.
SPSS uses the first technique, a probability method based on Bayesian statistics, to derive a rule
to classify cases. The rule uses two probability estimates. The prior probability is an estimate of
the probability that a case belongs to a particular group when no information from the predictors
is available. Prior probabilities are typically either determined by the number of cases in each
category of the grouping variable, or by assuming that the prior probabilities are all equal (so that
if there are three groups, the prior probability of each group would be 1/3). We have more to say
about prior probabilities below.
Second, the conditional probability is the probability of obtaining a specific discriminant score
(or one further from the group mean) given that a case belongs to a specific group. By assuming
that the discriminant scores are normally distributed, it is possible to calculate this probability.
With this information and by applying Bayes’ rule, the posterior probability is calculated, which
is defined as the likelihood or probability of group membership, given a specific discriminant
score. It is this probability value that is used to classify a case into a group. That is, a case is
assigned to the group with the highest posterior probability.
Although SPSS uses a probability method of classification, you will most probably use a method
based on a linear function to classify new data. This is mainly for ease of calculation because
calculating probabilities for new data is computationally intensive compared to using a
classification function. This will be illustrated below.
Assumptions of Discriminant Analysis

As with other general linear model techniques, discriminant makes some fairly rigorous
assumptions about the population. And as with these other techniques, it tends to be fairly robust
to violations of these assumptions.
Discriminant assumes that the predictor variables are measured on an interval or ratio scale.
However, as with regression, discriminant is often used successfully with variables that are
ordinal, such as questionnaire responses on a five- or seven-point Likert scale. Nominal variables
can be used as predictors if they are given dummy coding. The grouping variable can be
measured on any scale and can have any number of categories, though in practice most analyses
are run with five or fewer categories.
Discriminant assumes that each group is drawn from a multivariate normal population. This
assumption can be and is violated often, especially as sample size increases, and moderate
departures from normality are usually not a problem. If this assumption is violated, the tests of
significance and the probabilities of group membership will be inexact. If the groups are widely
separated in the space of the predictor variables, this will not be as critical as when there is a fair
amount of overlap between the groups.
When the number of categorical predictor variables is large (as opposed to interval–ratio
predictors), multivariate normality cannot hold by definition. In that case, greater caution must be
used, and many analysts would choose to use logistic regression instead. Most evidence indicates
that discriminant often performs reasonably well with such predictors, though.
Another important assumption is that the covariance matrices of the various groups are equal.
This is equivalent to the standard assumption in analysis of variance about equal variances across
factor levels. When this is violated, distortions can occur in the discriminant functions and the
classification equations. For example, the discriminant functions may not provide maximum
separation between groups when the covariances are unequal. If the covariance matrices are
unequal but the variables’ distribution is multivariate normal, the optimum classification rule is
the quadratic discriminant function. But if the matrices are not too dissimilar, the linear
discriminant function performs quite well, especially when the sample sizes are small. This
assumption can be tested with the Explore procedure or with the Box’s M statistic, displayed by
Discriminant.
For a more detailed discussion of problems with assumption violation, see Lachenbruch (1975) or
Huberty (1994).
Analysis Tips
In addition to the assumptions of discriminant, some additional guidelines are helpful. Many
analysts would recommend having at least 10 to 20 times as many cases as predictor variables to
insure that a model doesn’t capitalize on chance variation in a particular sample. For accurate
classification, another common rule is that the number of cases in the smallest group should be at
least five times the number of predictors. In the interests of parsimony, Huberty recommends
having a goal of only 8 to 10 response variables in the final model. Although in applied work this
may be too stringent, keep in mind that more is not always better.
Outlying cases can affect the results by biasing the values of the discriminant function
coefficients. Looking at the Mahalanobis distance for a case or examining the probabilities is
normally an effective check for outliers. If a case has a relatively high probability of being in
more than one group, it is difficult to classify. Analyses can be run with and without outliers to
see how results are affected.
Multicollinearity is less of a problem in PDA because the exact effect of a predictor variable is
typically not the focus of an analysis. When two variables are highly correlated, it is difficult to
partition the variance between them, and the coefficient estimates are often incorrect. Still, the
accuracy of prediction may be little affected. Multicollinearity can be more of a problem when
stepwise methods of variable selection are used, since variables can be removed from a model for
reasons unrelated to that variable’s ability to separate the groups.
Note on Course Data Files

All files for this class are located in the c:\Train\Advstat folder on your training machine. If you
are not working in an SPSS Training center, the training files can be copied from the floppy disk
or CD that accompanies this guide. If you are running SPSS Server (click File…Switch Server to
check), then you should copy these files to the server or a machine that can be accessed (mapped
from) the computer running SPSS Server.
A Note about Variable Names and Labels in Dialog Boxes

SPSS can display either variable names or variable labels in dialog boxes. In this course we
display the variable names in alphabetical order. In order to match the dialog boxes shown here,
from within SPSS:
Click Edit…Options
Within the General tab sheet of the Options dialog box:
Click the Display names option button

Click the Alphabetical option button
Click OK, click OK to confirm
A Two-Group Discriminant Example

To demonstrate predictive discriminant analysis we will use data from a customer satisfaction
study similar to many applications of this technique. Data collected on customers who purchased
a VCR included demographic information, rating of various product and company attributes, and
the likelihood of buying another VCR from the company. The relevant variables are:
buyyes Willingness to buy another VCR

age Age of respondent
complain Performance: complaint resolution
educ Education of respondent
fail Did product ever fail to operate?
pinnovat Performance: innovative company
preliabl Performance: reliability
puse Performance: ease of Use
qual Performance: overall quality
use Frequency of use
value Performance: good value for money
The goal of this analysis is to determine what set of demographic and attitude items best predict
which customers might buy another VCR. The grouping variable is called buyyes and is coded
into two categories, those not likely to buy again, and those likely to do so. None of the variables
in the analysis are measured on a true interval scale, yet the use of such variables in discriminant
is quite common and will allow us to explore potential problems that arise.
Click File…Open…Data
Move to the c:\Train\Advstat directory (if necessary)
Double click on CSM.sav
Rather than spend time studying the data closely, we perform one preliminary analysis.
Checking Variance Assumption

One simple test is to see whether the variances for each group are similar for each predictor
variable. Although this is not equivalent to testing for equality of covariance matrices, it is a good
first step. The Explore procedure provides the necessary output.
Click Analyze…Descriptive Statistics…Explore

Move age, educ, and qual into the Dependent List box
Move buyyes (the grouping variable) into the Factor List box
Click the Plots button
Click the Power estimation option button
Figure 2.2 Plots Dialog Box in Explore
The power estimation choice provides a direct test of homogeneity of variance for each predictor
(and estimates the power transformation that best stabilizes the variance). The boxplots will
provide a graphical view of the distributions. Notice that normality can also be tested (the
Normality plots with tests check box), but there is no point in doing so here since the variables
can’t be normally distributed.
Click Continue, click OK
Using the Outline Pane, locate the Test of Homogeneity pivot table, or scroll to it in the Contents
Pane. The Levene statistic is a direct test for whether or not the variance of each predictor
variable is the same for each category of buyyes (willingness to buy another VCR). A low
probability means that we can reject the null hypothesis of equality. In this instance, we conclude
that variances are unequal for age and overall quality rating, but similar for education.
Figure 2.3 Levene Test for Homogeneity of Variance
A graphical representation is afforded by the boxplots for each predictor. Click on the Boxplot
choice under Age of respondent in the Outline Pane to see the one for age, displayed in Figure 2.4
(age is an ordinal variable). The widths of the two boxes—equivalent to the interquartile range—
are dissimilar, so the variance of those likely to buy a VCR again is greater than for those not
likely to buy again. By itself, this type of information is only a very preliminary look at the data.
We will make a more detailed test of the assumption of equal covariance matrices when running
the discriminant analysis. But we will keep in mind this potential problem. In an actual analysis,
make sure you examine all the predictor variables.
Figure 2.4 Boxplot for Age Grouped by Buyyes
Running a Discriminant Analysis

Click Analyze...Classify...Discriminant
In the main dialog box, you must specify at least a grouping variable and one independent or
predictor variable. Although SPSS will run an analysis with this minimum set of specifications,
you will always request additional output.
Move buyyes into the Grouping Variable box

Click Define Range button, and type 1 for the Minimum and 2 for the Maximum
Click Continue
Move age, complain, educ, fail, pinnovat, preliabl, puse, qual, use, and value into the
Independents List box
Figure 2.5 Discriminant Analysis Main Dialog Box
As with regression, the model can be created using all independent variables together or, instead,
by using a stepwise method of selection. In this example we use the default choice of all
predictors together.
Click the Statistics button

Click the Box’s M check box and the Fisher’s check box
Figure 2.6 Statistics Subdialog Box
By default, none of the options in the Statistics box are selected. The Box’s M statistic is a direct
test of the equality of covariance matrices. The four Matrices choices allow you to examine
various correlations between the predictors, or various forms of the covariance between the
predictors. The two sets of function coefficients are in addition to the standardized coefficients
printed by default. The Fisher’s classification function coefficients, as described previously, can
be used to classify new cases. The unstandardized coefficients are the raw coefficients from the
discriminant model. Since they predict a discriminant score and are in the original units of the
variables, they are typically not that useful, and they are not interpreted. They can be used to
make predictions, but they are not as convenient to use as the Fisher coefficients when there are
more than two groups.
SPSS will always, by default, provide significance tests to help you determine whether the set of
predictors is good at discriminating between the groups.
The specification of the predictor variables and the coefficients derived from the model are the
first step in requesting an analysis. Next, we must ask for summary statistics to see how well the
model classified cases.
Click Continue
Click the Classify button
Click the Summary table check box
Click the Separate-groups check box in the Plots area
Figure 2.7 Classification Subdialog Box
By default, SPSS provides no direct information about how well the model performed, except for
the overall significance tests noted above. Most important is the classification table that directly
shows how well the cases were classified (selected via the Summary table check box).
Plots are also useful, and they are created in the space of the discriminant functions, not the
original variables. In this analysis, there can be only one function since the number of groups is
two, and so a combined groups plot and territorial map are not produced.
By default, the prior probabilities are set to all groups equal. We will change this in the next
analysis.
Click Continue
Click the Save button
Click Discriminant scores check box
Figure 2.8 Save New Variables Subdialog Box
For additional analyses SPSS allows you to save three types of information. The predicted group
membership is calculated as described above. The discriminant scores are the scores for each case
on each discriminant function. These are the scores in the original units, not based on the Fisher
functions. These scores can be used to create additional plots. Third, the probabilities of group
membership are the posterior probabilities of membership in each group, so there will be k
probabilities for k groups. Model information needed to calculate these statistics can be written in
an XML (Extensible Markup Language) format (for web-based deployment of the model).
Before we turn to more interesting output, it is important to see how many cases have been
excluded from the analysis. The first table produced in the Content Pane provides this
information.
There are 605 cases in the file, but SPSS used only 438, or 72.4%, in the analysis. Cases are
missing on the predictors (109), the grouping variable buyyes (31), or both (27). As with any type
of analysis, when a substantial fraction of cases have missing data and are removed, you should
be more cautious in interpreting the results, and you should see whether the missing cases are
different from the non-missing cases (the SPSS Missing Values Analysis module provides means
to do such checking). In this instance, we will proceed with the analysis and examine the rest of
the output.
Figure 2.9 Summary of Cases Used in Analysis
Scroll down to Eigenvalues and Wilks’ Lambda tables
The Eigenvalues table provides information about the relative efficacy of each discriminant
function. With only one function, the percent of variance is always 100% (don’t assume this
means that all the variance in the grouping variable is actually explained). The useful statistic in
this table is the canonical correlation, which measures the association between the discriminant
scores and buyyes. Here it is equivalent to the standard Pearson correlation coefficient (when
there are more than two groups it is equivalent to eta from a one–way analysis of variance). This
correlation is reasonably high, indicating that the model is likely to have some predictive ability.
The square of the canonical correlation, .441, is the proportion of total variance in the
discriminant scores explained by differences among the groups, i.e., the two categories of buyyes.
Figure 2.10 Overall results for the Discriminant Model
In the second table, Wilks’ lambda is .559 (1 – .441), or the proportion of variance not explained.
Lambda is used to test the null hypothesis that the means of the two groups on the discriminant
function are equal. The null hypothesis is one of no difference, so with a p value less than .005,
we can reject that hypothesis and conclude that the means differ. If this test were not significant at
a chosen alpha level, it would mean that the predictors had no discriminatory ability. Of course,
in large data sets, as with any statistical test, rejecting the null hypothesis doesn’t mean that the
effect size is necessarily large. In predictive discriminant analysis, the fundamental criterion of a
good model is predictive accuracy, and lambda is not a test of that question.
Scroll up now to the Box’s M test for equality of covariance matrices. As we saw above with the
boxplots, there is evidence that the variances are not equal for at least age. The null hypothesis of
equality is well rejected, as the probability is zero for all intents and purposes.
Figure 2.11 Box’s M Test of Equality of Covariance Matrices
However, the Box’s M test is quite powerful and leads to rejection of equal covariances when the
ratio N/p is large, where N is the number of cases and p is the number of variables. Here the ratio
is over 40 to 1, but that is still not that large. The test is also sensitive to lack of multivariate
normality, which applies to these data. Nevertheless, the very low probability value indicates that
the covariance matrices are probably not equal. The effect on the analysis is to create errors in the
assignment of cases to groups. We will attempt to correct for this problem in a later analysis by
requesting the use of separate group covariance matrices.
Analysis Tip
With large datasets, use an alpha level of .01 or smaller in the Box’s M test.
Discriminant Coefficients
We did not request the raw discriminant coefficients because they are not that useful in applied
work, especially when there is more than one function. Instead, we examine first the standardized
canonical discriminant function coefficients. As for any type of standardized coefficient, they
allow us to directly compare variables measured on different scales. In combination with the
structure coefficients in the next table, they can be used to assess the relative impact of the
predictors. The standardized coefficients can take on values greater in absolute value than one.
Figure 2.12 Standardized Canonical Coefficients
Only one of the predictor variables has a reasonably large coefficient, pinnovat (rating of the
company as innovative), with a coefficient of .593. Interpreting the effect of pinnovat can be
tricky because the direction of effects is arbitrary. To be sure of the meaning of a positive or
negative score, you need to check the location of the group centroids. Scroll down to the
Functions at Group Centroids table.
Figure 2.13 Group Centroids
This table presents the values of the unstandardized discriminant function at the group means, in
other words, at the mean for all ten variables for each group. The discriminant function was
created to maximize the separation of the two categories of buyyes. The value of the function for
category 2, those customers likely to buy, is positive (.613). This means that higher scores on a
variable with a positive coefficient will be associated with group 2 memberships.
Turning back to the effect of pinnovat (Figure 2.12), this implies that a higher rating of company
innovation makes a customer more willing to buy another VCR. The coefficients for all the
predictors are positive, if modest, except for education (-.146). This means that increasing
education is associated with a decreased likelihood of buying again.
Although only the coefficient for pinnovat is large, deciding whether any variables should be
dropped can’t be done until additional output is studied.
Scroll down to the Structure Matrix table. The structure coefficients have several interpretations,
but as noted in the footnote to the table, they are calculated as the pooled within-groups
correlations between the predictors and the standardized functions. They tell how closely a
variable and a function are related, rather than expressing, as do the standardized coefficients
themselves, the effect of a variable on a function score. One advantage of the structure
coefficients is that they are not affected by multicollinearity; a second is that they tend to be more
stable in small samples.
Figure 2.14 Structure Matrix
The largest coefficient is again for pinnovat, and the next largest for frequency of use, as before.
However, the third largest coefficient (.604) is for complaint resolution, which had a very small
standardized coefficient. Only the age of the respondent has relatively small coefficients in both
tables. Although interpretation of the coefficients is informative, that is not the focus of PDA.
Classification Statistics
Next, Discriminant reports on how well the discriminant function did in classifying cases into the
categories of buyyes. The first two tables (not shown here) tell us that 469 cases can be classified
(a case can be classified so long as it is not missing data on the predictors), and that the prior
probabilities were set to .50 for each group. Click in the Content Pane on the last part of the
output, the Classification Results table.
Figure 2.15 Classification Table
This table presents a crosstabulation of actual group membership and predicted group
membership, plus a classification of the ungrouped cases. The cases circled on the diagonal are
those that have been correctly classified. The total percentage correctly classified is 81.7%.
Whether or not this is acceptable cannot be decided in the abstract but must be related to the goals
of each analysis. In clinical applications, 81.7% might be too low, but in marketing research, that
percentage may be viewed as completely acceptable. Although the Wilks’ lambda for the
discriminant function is significant, the true worth of the discriminant analysis is in the
percentage of correct predictions.
How well did we do compared to chance? If we simply guess the larger group 100% of the time
(which is all that can be done if no information is available on the predictors), we would be
correct 296/438 = 67.5% of the time. Thus the 81.7% figure, while hardly perfect accuracy,
certainly does far better than guessing.
Note that the prediction was better for those customers likely to buy another VCR from the
company, 86.5% compared to 71.8%. Another way to view accuracy is to calculate column
percentages. For those predicted to buy again, the percentage of correct predictions is 256/296 =
86.5%; for those predicted as not likely to buy again, the percentage correct is 102/142 = 71.8%.
In this instance the percentages are identical, but they usually are not.
Click the histogram for those likely to buy another VCR
This histograms display the unstandardized discriminant scores for each group. In an ideal
analysis, the scores would be completely separated on either side of zero, each distribution
following a normal curve. In this histogram, mispredictions essentially occur for those cases with
negative scores (recall that group 2 of buyyes had a positive centroid score). It is worth examining
these histograms to get a sense of the separation of the groups. The distribution for those not
likely to buy again (not shown) has a higher standard deviation, a graphical indication of the
greater difficulty of predicting membership in this category.
Figure 2.16 Discriminant Score Histogram for Those Likely to Buy Again
Prediction
Next, scroll up in the Content Pane to the Classification Function Coefficients. These are the
Fisher coefficients that are used for prediction of new cases. There is one function for each group.
Figure 2.17 Fisher Linear Discriminant Functions Used for Prediction
To predict whether a new customer is likely to buy a second VCR, the value of each variable for
each case is multiplied by its coefficient, these products are summed together, and the constant is
added in for the first function. This is repeated for the second function. The customer is assigned
to the group with the largest function score. You can use the SPSS transformation language to
create the Fisher scores and determine which one is largest to make a prediction in a new dataset.
You can also use the functions to ask “What if?” questions. For example, if a customer’s rating
on pinnovat goes from 5 to 7, holding the other variables constant, will her predicted group
change from 1 to 2?
Extending the Analysis

There are several directions by which the first analysis can be elaborated, improved, or tested. We
will continue to use the variable buyyes as a grouping variable so that changes to the analysis can
be readily compared. At the end of this chapter, we will run an analysis with a grouping variable
that has more than two categories.
Assumption of Equal Covariance

The homogeneity of variance assumption was not met, so classification was not necessarily
optimal. This may not be a problem in this analysis if we are satisfied with the percentage of
correct predictions. And in general, if the groups are well separated in the discriminant space,
heterogeneity of variance will not be terribly important. The one straightforward step that can be
taken is to ask SPSS to do classification using separate covariance matrices. This option can be
selected in the Classify subdialog box. Using separate covariance matrices does not affect results
prior to classification. However, since SPSS does not use the original scores to do classification,
you will not be able to reproduce these results on new data easily. The use of the Fisher
classification functions is not equivalent to classification by SPSS with separate covariance
matrices. What can be done is to request this analysis and compare it to classification based on
the pooled within–groups covariance and see if the results are similar or not.
Click the Dialog Recall tool , click Discriminant Analysis

Click the Classify button, click Separate-groups in the Use Covariance area
Click Continue, then on the Save button
Click the Discriminant scores check box to turn off this option
After the procedure is complete, click in the Outline Pane on the last item of the output, the
classification table.
The accuracy of prediction has decreased from 81.7% to 81.1%. Accuracy is the same for those
who are likely to buy again and decreases somewhat for those not likely to buy. At this point, the
judgment of the analyst becomes paramount. The change in accuracy is slight—which is typically
what occurs when using this alternative classification method—so you might decide that, despite
the violation of assumptions, the original analysis is acceptable. Again, the dilemma is that the
Fisher coefficients haven’t changed at all (verify this if you wish by comparing the two sets), so
classification with them cannot reproduce the results in Figure 2.18. Thus, for most applications,
users continue to pool the covariance matrices.
Modifying the List of Predictors

In our first analysis, we noticed that the variable age was not too important in either the structure
table or as a standardized coefficient. It has been shown in several studies that, unlike in
regression, predictive accuracy can decrease as the number of predictors increases (what
continues to decrease as predictors are added is Wilks’ lambda, the measure of unexplained
variance). We’ll drop age and see what difference it makes.

Click the Classify button and click the Within-groups option button
Click Continue
Remove age from the variable list
Click OK
The Eigenvalues table (not shown) lists the canonical correlation as .661, little changed from its
previous value of .664.
Scroll to the Standardized Canonical Coefficients pivot table in the Viewer window
With age removed from the analysis, the other coefficients have not changed much in value.
Notice the odd effect of overall quality (qual), which has a negative coefficient (but very small in
magnitude). This is undoubtedly due to its high association with several of the other predictors.
The structure coefficients (not shown) have also changed only slightly. This is to be expected.
Figure 2.19 Discriminant Function Coefficients
Scroll to the Classification pivot table
The overall percentage of correct predictions has increased to 84%, as has predictive accuracy for
those not likely to buy and those likely to buy again. This model with age removed is better, by
the standard of accuracy, than the model with age included as a predictor. This result—removing
variables and increasing accuracy— is quite common in discriminant applications. Could more
variables be removed and accuracy increased even further? It is certainly possible (and is left as
an exercise for the student). In general, creating an acceptable discriminant model is not done in
one step but with a series of refinements to the initially specified model.
Figure 2.20 Classification Table with Age Removed from Model
Casewise Statistics and Outliers

(For Those with Extra Time)
SPSS produces a detailed table providing statistics on each case, including its classification, the
probabilities used by SPSS in classifying the case, the squared Mahalanobis distance to each
group's centroid, and the discriminant score. We request this output for the first 30 cases because
the table is so large. These statistics are helpful in determining whether a case is clearly a member
of its correct group, almost as likely to be a member of another group, or extremely likely to be a
member of another group, none of which can be easily ascertained from the Fisher scores.

Click the Classify button
Click the Casewise results check box
Click the Limit cases to first check box, and then type 30 in the text box
Figure 2.21 Completed Classification Subdialog Box

Scroll to the Casewise Statistics table, activate the pivot table
Scroll down to case number 15
The table is so large that is might be easier to view if you open it as an SPSS pivot table object by
right–clicking on the table. The table is displayed in two pieces in Figures 2.22 and 2.23. Only
those cases are displayed that have no missing values on the predictor variables, and their row
number in the Data Editor identifies them.
Figure 2.22 Left Half of Casewise Statistics Table
The information displayed in Figure 2.22 is:
1. Actual Group: the category of buyyes to which the case belongs.

2. Predicted Group: the predicted category calculated by SPSS. Mispredictions are marked
with double asterisks.
3. P(D>d|G=g): The conditional probability, or the probability of the observed score given
membership in the most likely group.
4. P(G=g|D=d): The posterior probability used by SPSS to classify the case. It is the
probability of belonging to this group given the discriminant score.
5. Squared Mahalanobis Distance to predicted group’s centroid.
Figure 2.23 presents the same information for the second highest group, i.e. the group with the
second highest posterior probability. When there are only two groups, the two posterior
probabilities sum to 1.0.
Looking at case 15, we see that he or she is a customer who is likely to buy again and whom
SPSS correctly classified in that category. The conditional probability is quite high, .944, that
someone in group 2 could have this particular discriminant score (.539, see Figure 2.23—notice
how close it is to the group centroid, see Figure 2.13) or one further from the group's centroid. Its
squared distance to the group 2 centroid is consequently very small, .005. The probability that this
person is in group 2 is .836; the probability that this person is in group 1 is .164 (those two values
sum to 1), so SPSS assigned it to group 2.
The first step in looking for outliers is to locate cases with high squared Mahalanobis distance. In
this output that includes cases 25 and 28. For large samples from a multivariate normal
distribution, the squared Mahalanobis distance is approximately distributed as a chi–square with
degrees of freedom equal to the number of variables in the function (here 9). That means that
neither case 25 nor 28 is very distant, in a probabilistic sense, from its centroid (but buyyes is
missing for case 25, so this case won’t be used in the classification table). On the other hand,
because of the many categorical predictors, the assumption that the squared distance is distributed
as chi–square is dubious. Therefore, we recommend looking for outliers that are relatively large,
compared to the other values, especially in nonnormal distributions.
Case 28 has also been misclassified by SPSS. This customer was someone who would buy again,
but has been classified as the opposite. His or her probability of membership in group 1 is very
high, .991, so this is a case that has been badly misclassified. The squared distance to case 28
from both group centroids is large, 2.405 (Figure 2.22) and 11.748 (Figure 2.23), respectively.
The discriminant score is large and negative, placing it relatively far from both centroids (this is
why the conditional probability is only .121 for group 1 to which it has been assigned).
Scroll to the right so the last column can be viewed
Figure 2.23 Right Half of Casewise Statistics Table
You will also want to look at cases that were barely correctly classified, such as 18 or 19. Such
cases may be used to further refine the analysis by seeing what such cases may have in common.
Once outliers and interesting cases have been located, there are several additional analyses that
can be conducted, but typically the next step is to look at the raw data for the outliers to search for
patterns and commonalties (as with linear regression).
Adjusting Prior Probabilities

The question of which set of prior probabilities to specify is one for which no absolute advice can
be given. The prior probabilities provide Discriminant with information about the distribution of
buyyes, and they are used by SPSS to classify cases and to calculate the Fisher coefficients.
Changing them will definitely affect the classification results (but not the calculation of the other
coefficients or the Wilks’ lambda statistics).
Most authorities advise that priors should be set to the population sizes of each group, if this
information is known. Unfortunately, in many real–world applications, population sizes are
unknown. For this and other reasons, the default choice in SPSS is to set the priors equal, i.e.,
equal prior probabilities that a case belongs to any one group. The other option is to set the priors
equal to the relative sample sizes of each group. This is not recommended unless you are
confident that the relative proportions of the sample group sizes are similar to those found in the
population groups. For the customer dataset in these examples, this assumption might hold if the
response rate was high and tests for nonresponse bias detect few problems.
The effect of changing the prior probability on classification will be slight when the groups are
well–separated. Simulation studies have shown that the use of unequal priors tends to increase
error for the smaller groups and decrease it for the larger groups.
Analysis Tip:
Any set of prior probabilities can be specified using syntax.
We’ll change the priors to sample sizes and see what the effect is on classification.

Click the Classify button, and deselect Casewise results
Click the Compute from group sizes option button
Click Continue, then OK
Figure 2.24 Classification Results with Priors Set to Sample Sizes
Click on the Classification Results table in the Outline Pane. The overall percentage of correctly
classified cases has decreased to 81.1%, close to the value when age was included as a predictor.
As suggested by the simulation studies, the accuracy in the larger group, category 2, has increased
from 88.5% to 91.9% (compare to Figure 2.20), and the accuracy has decreased for those not
likely to buy again, to 58.5% from 74.6%. It would seem that we can adjust our results by playing
around with the prior probabilities with little cost, but of course there is no free lunch. Setting the
priors to sample group sizes means that this model has been optimized for these exact sample
proportions. In a new dataset, if the sample proportions are different, the performance of the
model will suffer. That said, since marketing research is concerning with finding potential
customers, the increase in accuracy to 91.9% for those likely to buy again is encouraging, and
despite the methodological caveat, an analyst may legitimately decide to set priors proportional to
the sample sizes. But if a conservative approach is to be taken, then leaving the priors equal for
all groups is the logical choice.
Validating the Discriminant Model

For any model, it is absolutely necessary that the analyst validate the results to determine how
well they will hold on new data. The general problem is that the sample is classified based on
parameters that are estimated from the sample itself. Thus the current model has been optimized
for this particular data and capitalizes on chance variation. The result is that the observed error
rate is probably too low and may not hold in the population.
SPSS provides two techniques to validate a model. The first is called “leave–one–out
classification.” It involves classifying each case into a group according to the classification
functions computed from all the data except the case being classified. The leave–one–out
classification will usually produce a lower level of predictive accuracy than the original model. It
is, however, still a bit too optimistic compared to how the discriminant function will perform on
completely new data.
To provide for a truer validation, SPSS allows you to limit the initial analysis to a subset of cases
defined by the value of a variable. In the language of validation, the initial sample is called the
training data and the second sample is called the holdout or test data. Typically, you randomly
split the data file into two parts (reasonable relative sizes might be 70% of the cases in the
training data and the remaining 30% in the test data) by creating a variable with two values. SPSS
then uses that variable to define the two datasets. This method is to be preferred to leave–one–out
classification, if enough data are available in the initial sample.
We’ll try leave–one–out classification, leaving the prior probabilities set to sample sizes.

Click Classify
Click on Leave–one–out classification
Figure 2.25 Classification Table with Cross–Validation
The percent of correctly classified cases has decreased but slightly, from 81.1% to 80.6% for the
cross-validation. The ungrouped cases are not used in the validation. Moreover, the accuracy for
those likely to buy a VCR has stayed the same, which is encouraging. Typically, the drop in
overall accuracy is larger than the 0.5% observed here. This validation provides further
confidence in the model that we have developed.
To use the other form of validation, we need to create random subsets from the data. This syntax
can be used to create a grouping variable, randvar, which splits the file into two groups of
customers, about 70% for the training data and 30% for the test data. We’ve set the seed value
here to insure that the same random sample is created.
SET SEED 1234567.

COMPUTE RANDVAR=UNIFORM(10).
RECODE RANDVAR (0 THRU 7=1) (7 THRU 10=2).
If we ask for a subset analysis in Discriminant (not shown), the resulting classification table is
displayed below (you specify randvar as the selection variable with a value of 1).
Figure 2.26 Classification Table for Subset Validation
Before looking at the results, note that the original sample here is a bit too small for reliable
results in this form of validation. We see that the percentage of correct predictions on the training
data is 85.1%, even higher than the 81.1% on the full file. The percent of correct classifications
for those likely to buy again has gone up to the rarefied value of 96.4%. The accuracy for
unselected cases in the test data (the holdout sample), which only included 122 cases that have a
value for buyyes, has decreased to 76.2%. This is a much greater drop than for the leave–one–out
method, although the drop for those likely to buy again is only about 4.5%.
Most analysts would consider these results to be quite acceptable. That is, the level of accuracy
has not dropped too severely. After validating the data with this method, the percentages in the
Cases Not Selected table should be used for planning purposes as indicators of likely success on
future data.
Stepwise Model Selection

As with regression, Discriminant provides a choice to create a model via stepwise selection
methods. The default method is to select the variable, at each step, that minimizes the overall
Wilks’ lambda. It is always tempting to use stepwise methods to assist in variable selection, but
some authorities counsel against it (see, for example, Huberty, 1994). The reasons are as follows:
1. The selection of variables is optimized for the current sample. That is, it capitalizes on
chance variation and correlation in this data that may not exist in future data. Since the
whole intent of PDA is to predict group membership in future datasets, this is a serious
pitfall. Of course, applying the results of the stepwise model to a validation data sample
would address this concern.
2. The stepwise methods minimize a criterion other than predictive accuracy, such as Wilks’
lambda or the Mahalanobis distance. Since the intent of the analysis is predictive
accuracy, this again is not a good thing. On the other hand, it is reasonable to expect that
variables chosen using this criterion will aid prediction.
No stepwise analysis will be run here, although this is suggested as an exercise. In terms of output
when one runs a stepwise model, although a great amount of additional output is created that
reports on variable selection at each step, the remainder is identical to what we have viewed
beforehand. Thus, if you are comfortable doing stepwise regression, you will be comfortable
doing a stepwise discriminant analysis.
Three–Group Discriminant Analysis

We conclude our study of Discriminant with a brief look at a group variable with three categories.
First, let’s open a new dataset, GSS93.sav, a subset of the 1993 General Social Survey from the
National Opinion Research Center.
Click File…Open…Data, move to the c:\Train\Advstat folder (if necessary)

Double-click on Gss93.sav
We’ve created a variable in this file called smkdrnk with the following definition:
Category Definition
0 Neither smoke nor have drunk too much
1 Either smoke or have drunk too much
2 Both smoke and have drunk too much
We want to use various demographic variables to predict in which category each respondent is
located. An HMO might use such a discriminant model to know who should be targeted for
preventive health care programs.
Analysis with a grouping variable with three or more categories differs mainly from the previous
analyses in that more than one discriminant function will be created. This complicates a bit what
must be studied, but the principles are basically identical, and the type of output is as well.
Click Analyze…Classify…Discriminant
Move smkdrnk into the Grouping Variable box
Click Define Range and type 0 and 2 in the Minimum and Maximum boxes
Click Continue
Move age, attend, black, class, educ, sex, white into the Independents list
Click the Statistics button and select Fisher’s and Box’s M
Click Continue, click the Classify button
Click Summary table, Combined-groups and Territorial map
The variables black and white are dummy variables for those racial groups. The variable attend
measures how often someone attends religious services (are those who attend more often less
prone to smoking and drinking?).
Scroll to the pivot table labeled Eigenvalues
There is one line in this table for each discriminant function. There are two functions because
there are three groups in smkdrnk. The largest eigenvalue (.160) and first function corresponds to
the vector in the direction of maximum dispersion of the group means in the multidimensional
space of the predictors. The second function, corresponding with the second eigenvalue, is
orthogonal to the first. The first function is clearly the most important because it explains 97.0%
of the variance. However, the canonical correlation is only .371, meaning that the correlation
between the first function scores and group contrasts on the variable smkdrnk is rather small. It
would appear that only one function is necessary to separate the categories.
Figure 2.27 Eigenvalues and Wilks’ Lambda
In the bottom table, Wilks’ lambda is, again, one minus the square of the canonical correlation for
each function. The first line, labeled “1 through 2,” is an overall test of whether the means, or
centroids, of both functions are equal in the three categories of smkdrnk. The null hypothesis is no
difference, which is rejected given the low significance value. Each successive test in a Wilks’
lambda table tests whether the additional functions, after removing the effect of previous
functions, reflect population differences or just random variation. Thus, when the first test is
significant, it means that there are significant differences on one or more of the discriminant
functions. The second test shows that there are no significant differences on the second function
by itself, and we conclude that only the first function shows significant group differences. We can
safely ignore results based on the second function in the structure matrix and standardized
coefficients tables.
Scroll down to the Structure Matrix table
We know that the second function can be ignored here, so the importance of the various variables
can be assessed from function 1. It would seem that the variable that best separates the groups is
frequency of attendance at religious services, followed by education, age, and class identification.
Even though the second function is non-significant, notice that education has a relatively small
correlation with this function even though it had a larger coefficient on the first function.
If you examine the standardized coefficients table and the centroids table nearby in the Output
Pane (not shown), you can see that higher scores on the variables are associated with lower values
of smkdrnk. That is, those who more frequently attend religious services, are older, have more
education, and are of a higher social class are all less likely to smoke and drink.
Scroll down to the Classification Function Coefficients table
The Fisher coefficients are used to classify a new case just as we did for the two-group problem.
The function with the highest score is the one to which a case is assigned. As you can see in
Figure 2.29, the only difference here is that there are three functions, one for each category of
smkdrnk.
Figure 2.29 Fisher Classification Functions
Scroll down to the Territorial Map
This is a scatterplot, in the space of the two canonical discriminant functions, of the centroids for
the three categories of smkdrnk. The centroid of each category is marked with an asterisk. The
plot is divided into three regions representing the areas in which a case is predicted to be in one of
the three categories. For example, the region on the right, bordered by “1,” represents cases
predicted to be in category 0 of smkdrnk (see the legend at the bottom of Figure 2.30), that is,
people who neither smoke nor drink.
Note that the centroids are not well separated, even on the first function. This indicates that it is
relatively difficult to predict group membership. And, the first function is clearly more important
because there is little difference in the second function's values for the three centroids, and
because the prediction regions are clearly best separated in function 1, not function 2, direction.
Figure 2.30 Territorial Map
Just below the territorial map is a similar high–resolution scatterplot that places all the
respondents in the same discriminant function space, with markers to indicate group membership
(called the “Canonical Discriminant Functions” plot). This graph is more easily viewed in color,
so it is not shown here. What you can learn from it is how poorly separated the cases are from the
three categories of smkdrnk even on function 1. The cases all cluster together near the center of
the plot, although there are some outliers, especially at negative values of function 2. It is clear
that prediction of group membership for smkdrnk is not very effective with our current set of
predictors. To see this quantified:
Scroll down to the Classification Results table
Figure 2.31 Classification Results for Smkdrnk
Our results are poor, as only 46.5% of the cases were correctly classified. We did best for those
who neither smoke nor drink too much and those who both smoke and have drunk too much.
Those in the middle category had a correct prediction in only 26.4% of the cases. By chance we
either could have randomly assigned a case to one of the three categories, with an accuracy of
33.3% overall, or we could have chosen the mode for smkdrnk (category 0), which would have
yielded a chance accuracy of 353/(353 + 299 + 103) = .467. With this as the definition of chance,
our model actually performs a bit more poorly, even though the first discriminant function was
significant. This should emphasize the point that significance in discriminant analysis is not the
same as predictive accuracy.
Clearly, it’s back to the drawing board to predict categories of smkdrnk, but for us, it’s the end of
our study of discriminant analysis.
Chapter 3
Binary Logistic Regression
Topics
• Introduction
• How Does Logistic Regression Work
• The Elements of Logistic Regression
• Assumptions of Logistic Regression
• A First Example
• Interpreting Logistic Regression Coefficients
• Predictive Accuracy
• Residual Analysis
• Stepwise Logistic Regression
• ROC Curves
• Appendix: Comparison to Discriminant Analysis
Introduction
Many situations in data analysis involve predicting the value of a categorical outcome variable.
These include applications in medicine predicting the health status of a patient, in marketing
research predicting whether a person will buy a product, or in schools predicting the success of a
student. Logistic regression is a technique that can be very helpful in these, and many other,
situations.
Logistic regression is designed to use a mix of continuous and categorical predictor variables to
predict a categorical dependent variable. It is often seen as an alternative to discriminant analysis.
Many of the concepts we discussed in the chapter on discriminant analysis apply to logistic
regression, such as the classification table, the predicted category membership of each case, the
probability of membership, and an ordering of the relative importance or impact of the predictor
variables.
Unlike discriminant analysis, logistic regression requires fewer, and less stringent, assumptions.
In addition, even when the assumptions of discriminant analysis are satisfied, logistic regression
performs almost equally well.
In this chapter we will discuss logistic regression where the dependent variable has two categories
(binary logistic) and we will cover the general case (more than two outcome categories) in
Chapter 4 (multinomial logistic regression).
Binary Logistic Regression 3 - 1

How Logistic Regression Works

Multiple regression cannot be used to predict the values of a dichotomous dependent variable for
two reasons. First, when we try to predict the values of a variable coded say, 0 or 1, we can
consider the predicted values to be probabilities, that is, the probability of obtaining a predicted
value of 1. In multiple regression with a straight-line fit to the data, it is often the case that values
less than 0 or greater than 1 are predicted. That is clearly a problem.
Second, one of the key assumptions of regression is homogeneity of variance. However, for a
dichotomous variable, the mean and the standard deviation are related because the standard
deviation is [(p)(1–p)]1/2, where p is the mean of the variable. Since there is a functional
relationship between the standard deviation and the mean, homogeneity of variance across values
of the dependent variable cannot be satisfied.
Logistic regression was developed in the 1960s as one solution to this problem. When predicting
the value of a variable that varies on a scale from 0 to 1, it makes sense to fit an S-shaped curve to
the data, as depicted in Figure 3.1. Imagine that we are trying to predict the likelihood, or
probability, of someone buying a home, using income as the predictor. At very low levels of
income, or for small values of the variable X in Figure 3.1, the likelihood of buying a home is
very small. As income rises, more and more people are able to afford a home, and the rate of
increase in the probability, per unit of income, also rises. Then, at high levels of income, most
people can afford a home, and since the likelihood must be less than or equal to 1, the curve
levels off, so that the rate of increase of probability per unit of income decreases. The result is an
S-shaped, or logistic curve, that is also used in biology and economics to model many
relationships.
Figure 3.1 Logistic Curve
The logistic function is bounded at zero and one, so impossible predictions cannot occur. There is
actually a whole family of S-shaped functions, the probit being the other well-known variant. Due
to various considerations, most statisticians eventually agreed upon the logistic as the model of
choice for regression with a dichotomous dependent measure.

The Logistic Equation

Binary logistic regression (which will be referred to simply as logistic regression) is regression
applied to a dichotomous dependent variable, where the dependent variable is not the raw data
values, but instead is the odds of the event of interest occurring. Specifically, the general equation
for logistic regression is:
ln(Odds) = α + Β1 X 1 + Β 2 X 2 + K + Β k X k
where the terms on the right are the standard terms for the independent variables and the intercept
in a regression equation. However, on the left-hand side is the natural log of the odds, and the
quantity ln(Odds) is called a logit. It can vary in principle from minus to plus infinity, thus
removing the problem of predicting outside the bounds of the dependent variable. The odds are
related to the probability by:
Prob
Odds =
1 − Prob
Note that there is a linear relationship with the independent variables in logistic regression, but it
is linear in the log odds and not in the original probabilities. Since we are interested in the
probability of an event, i.e., the higher code in a dichotomous variable, the logistic equation can
be transformed into an equation in the probability. It then has this form:
1
prob(event ) = − (α + Β1 X 1 + Β 2 X 2 +K+ Β k X k )
1+ e
This equation cannot be estimated with the least–squares method; instead, the parameters of the
model are estimated using a maximum–likelihood technique. We derive coefficients that make
our observed values most “likely” for the given set of independent variables. This must be done
through iteration by SPSS.
Elements of Logistic Regression Analysis

There are two general goals when doing logistic regression:
1. Determine the effect of a set of variables on the probability, plus the effect of individual
variables
2. Attain the highest predictive accuracy possible with a given set of predictors
These two goals are not mutually incompatible, but one or the other tends to be the focus of an
analysis. Those interested in theory and causal effects typically are more concerned with the first
goal; those concerned with predicting whether a future event will fall in one or the other category
of the dependent variable focus on the second goal.
Many steps in doing logistic regression are similar to standard regression. You must first select a
reasonable set of predictors, and you must examine the data closely beforehand to look for
unusual patterns, outliers, missing data problems, and so forth. After estimating the equation and
examining the effect of individual variables, you should do a few checks to see whether the data

meet the assumptions of the logistic model and look for cases that have undue influence on the
results. If you are interested in predicting the category membership on future data then it is very
important that the model be validated. This means deriving a model on a subset of the data and
then testing it on the holdout sample.
Of course, there are differences from the standard Ordinary Least Squares (OLS) multiple
regression. There is more than one way to measure the fit of the model, and more than one way to
measure the amount of explained variance. Since the mean and variance are related, it was
initially difficult to derive a reasonable equivalent of the R2 in multiple regression, but now SPSS
includes pseudo R2 statistics.
As we saw for discriminant analysis, the goodness of fit or significance of a model does not
necessarily equate to high predictive accuracy. As sample size increases, a set of independent
variables can be statistically significant, but still not yield a high percentage of correct
predictions.
The classification of cases is a simple matter in binary logistic regression. A case is predicted to
be in the lower value of the dependent variable if its predicted probability is less than 0.50;
otherwise, it is predicted to be in the upper category.
Assumptions of Logistic Regression

Logistic regression requires fewer assumptions than OLS regression. Logistic requires that:
1. The independent variables be interval, ratio, or dichotomous;

2. All relevant predictors be included, no irrelevant predictors be included, and the form of
the relationship is linear;
3. The expected value of the error term is zero;
4. There is no autocorrelation;
5. There is no correlation between the error and the independent variables;
6. There is an absence of prefect multicollinearity between the independent variables.
The next two assumptions are made in OLS regression, but not logistic regression:
1. Normality of errors: the errors are assumed to follow a binomial distribution, which only
approximates a normal distribution for large samples.
2. Homogeneity of variance: as we discussed above, this condition cannot hold by
definition.
Unlike discriminant analysis, the use of a large number of dummy variables as predictors does not
violate any assumption of logistic regression, which is why it is often preferred in those
situations.
We will return to the topic of assumptions after running a logistic regression.
One more point should be mentioned. All things being equal, logistic regression requires larger
sample sizes than OLS regression for correct inference. Although authorities disagree on exactly
how large, a reasonable rule of thumb is to have at least 30 times as many cases as parameters
being estimated in the model.

Logistic Regression Example: Low Birth Weight

We will use a data set from Hosmer and Lemeshow (2000) in our first example. The data concern
the prediction of low birth weight babies based on characteristics of the mother. The dependent
variable is LOW (Low Birth Weight: 1 = less than 2500 grams, 0 = over 2500 grams). Possible
predictor variables are:
lwt Weight at last menstrual cycle

age Age in years
smoke Smoking status during pregnancy, where 0 = no and 1 = yes
ptl History of premature labor, coded with the number of premature deliveries
ht History of hypertension, where 0 = no and 1 = yes
ui Presence of uterine irritability, where 0 = no and 1 =yes
ftv Number of physician visits during the first trimester
race Where 1 = white, 2 = black, and 3 = other
The exact birth weight in grams (bwt) is also included. Data are available on 189 births. Logistic
regression is used quite often in clinical settings for problems of this nature.
The SPSS data file is named Birthwt.sav in the c:\Train\Advstat directory.
Move to the C:\Train\Advstat directory (if necessary)
Double-click on Birthwt
We will not spend any time here examining the independent variables, but be sure you do so
before running any logistic regression. We turn instead to how to request a binary logistic
regression in SPSS.
Click Analyze…Regression…Binary Logistic
The minimum specification for the procedure is a dependent variable and one independent
variable, or “Covariate,” as SPSS labels the predictors.
Move low into the Dependent list box

Move age, ftv, ht, lwt, ptl, race, smoke, ui into the Covariates list box
As with regression, the model can be created using all independent variables together or, instead,
by using a stepwise method of selection. In this example we will use the default choice of all
predictors entered together.

Figure 3.2 Binary Logistic Regression Dialog
By default, SPSS assumes that the independent variables are interval-ratio (or dichotomous,
which can be viewed as a special case of interval scale). That is why they are labeled covariates.
Unlike the SPSS Regression or Discriminant procedures, SPSS Binary Logistic Regression has
the ability to automatically create dummy variables to include nominal variables in an analysis.
Logistic Regression can use string variables as predictors because of this feature.
Click the Categorical button

Move race into the Categorical Covariates box
In the resulting dialog box, all the variables will initially appear in the list of Covariates (except
for string variables, which automatically appear as Categorical Covariates). The variable race is a
nominal variable with three categories. We must inform SPSS of that fact and, if desired, choose
a coding scheme.
Figure 3.3 Completed Define Categorical Variables Box
Although many analysts are familiar with what is called “dummy” coding for categorical
variables, seven options are available in SPSS to code categorical variables. More technically,

these are all forms of contrast coding. The default choice of indicator coding will provide
estimates of the effect of a category of the independent variable on the dependent variable,
compared to a reference category of the independent variable. This is, in other words, the usual
form of dummy coding. Notice that the reference category by default is the last category. This is
fine for the variable race since the last category is “other.”
Click Continue,
Click the Probabilities and Group membership check boxes
In the Regression procedure several plots of residual statistics can be requested. The same plots
are not available directly in logistic regression, but they can be created after saving various
residuals and influence measures. Both the predicted probabilities and the predicted group
membership can be saved. We will save both.
Figure 3.4 Save New Variables Dialog
There are two types of statistics that can be saved. Influence measures help you determine how
influential a case is on the overall regression results. Residual statistics help to determine whether
a case is an outlier. Graphical techniques can be used with all of these to find cases that deserve
more study. Also, the model can be saved in XML format, as with Discriminant.
Of the influence measures, Cook’s statistic tells how much the residuals of all the cases would
change if this particular case were excluded. The leverage values provide a general idea of the
potential influence of each case on the model’s fit. And the DfBeta statistics provide information
on how much each case influences the regression coefficients themselves. No one of these can
necessarily be recommended over any other, and it is often helpful to look at more than one. The
Advanced Techniques: Regression course examines some of these in the context of linear
regression. We will return to this issue in our next analysis.
The logit residual is the error in predicting the log odds. The unstandardized residual is the error
in predicting the probability. The standardized residual adjusts each residual by the standard error
in predicting the probability. The other two residuals are more technical. Again, we will request
these in a later analysis.

Click Continue
Click the Options button
Click the Classification plots and Hosmer-Lemeshow goodness-of-fit check boxes
Figure 3.5 Options Dialog
The Options dialog box has a mix of choices. Under Statistics and Plots you can and should
request a classification plot. The Hosmer-Lemeshow goodness-of-fit tests the goodness-of-fit of
the model using the observed and predicted number of events. The casewise listing of residuals is
equivalent to the same output from the Regression procedure. If you plan to use the logistic
regression equation to make predictions, it can be helpful to calculate confidence intervals for the
regression coefficient estimates; this option is available with the CI for exp(B) choice.
If you request a stepwise analysis, the probabilities for entry and removal can be modified from
their default values. An interesting option is the classification cutoff, which is set to .5 by default.
This is the cutoff value to which the predicted probability for each case is compared when SPSS
assigns cases to predicted groups. You can change this value to improve predictive accuracy, but
you should only do so if you are very experienced. For assistance in this, see the ROC plot later
in this chapter.
Click Continue, and then click OK
Logistic first displays a case processing summary indicating how many cases are used in the
analysis—all the cases, or 189, in this instance. It also provides information on whether it had to
temporarily recode the outcome variable so as to conform to a 0,1 coding (not required here since
low is coded 0,1). Logistic also provides the parameter coding for the dummy variables.

Figure 3.6 Initial Summary Output
SPSS next provides some baseline results under the title “Block 0: Beginning Block.” These are
based on a logistic model containing only an intercept (constant). Although this model is not of
interest, some useful baseline information is obtained. First, the Classification table (not shown,
but will be discussed shortly) indicates that a model always predicting the most common outcome
category (normal birth weight) is correct for 68.8% of the sample. It correctly predicts all the
normal birth weight babies, but misses all of the low birth weight babies. This provides a baseline
against which to evaluate later models.
Additional tables present information about variables in the equation (only the constant) and
variables not in the equation. This latter summary indicates which variable is likely to be entered
next when using stepwise methods and will be discussed later in the chapter.
SPSS indicates which variables were entered in this analysis and displays statistics on the overall
fit of the model. There is some redundancy here when all the independent variables are entered in
one step through forced entry. The probability of the observed results, given the parameter
estimates, is known as the likelihood. It is customary to use –2 times the natural log of the
likelihood (-2LL) as a measure of model fit since it has ties to the chi-square distribution. A good
model that has a high likelihood translates to a small value for –2LL. For a perfect fit, -2LL
would be equal to zero.
Scroll to the Block 1: Method Enter results
Figure 3.7 Model Fit and Test Statistics

The Model chi-square is a statistical test of the null hypothesis that the coefficients for all of the
terms in the model are zero. It is equivalent to the overall “F” test in regression. Its value, 31.118,
is simply the difference between the initial (not displayed, but based on the model containing only
the constant) and final -2LL. It has 9 degrees of freedom, which is the difference between the
number of parameters in the two models. We reject the null hypothesis because the significance is
so low, .000 (to 3 decimals), and conclude that the set of variables improves the prediction of the
log odds.
The other two tests, Block and Step, have the same value as the Model statistic because we
entered only one block of variables and did not do a stepwise model selection. If we had, they
would display the results of the last block or step.
With all the variables in the model, the goodness-of-fit -2LL statistic is 203.554, shown in the
Model Summary table in Figure 3.7. This fit statistic is not usually interpreted directly, but the
model, block and step chi-square values are based on changes in the -2LL value.
Logistic regression also provides two measures that are analogs to R2 in OLS regression. Because
of the relationship between the mean and standard deviation for a dichotomous variable, the
amount of variance explained by the model must be defined differently. The Cox and Snell
pseudo R2 is .152 and the Nagelkerke pseudo R2 is .213. Usually the Nagelkerke pseudo R2 is to
be preferred because it can, unlike the Cox and Snell R2, achieve a maximum value of one. By
either measure, the independent variables can only explain a modest amount of the variance.
Next in the output are the Hosmer and Lemeshow goodness-of-fit test and table summaries,
shown in Figure 3.8. The test statistic is calculated by dividing the cases into ten approximately
equal sized groups based on the estimated probabilities, then comparing the observed to the
expected, or predicted, number of observations in each category of the dependent variable. The
goodness-of-fit statistic is 11.323, distributed as a chi-square value, with significance of .184.
When comparing observed and expected events in the context of testing goodness-of-fit, you
hope to find a non-significant probability, which indicates that the expected and observed events
are close, in turn implying that the model is a good fit. Here, the model does appear to fit,
confirming the change in –2LL test (test of model).

Figure 3.8 Hosmer and Lemeshow Goodness of Fit Test and Table
There are two potential problems with the Hosmer and Lemeshow statistic. First, reasonably large
sample sizes are necessary so that the expected number of events in most groups is 5 or greater.
You can see from the second Expected column that this assumption is violated in the birth weight
data. Conversely, with a large sample size, it is easier to reject the null hypothesis of no
difference since the value of a chi-square statistic is proportional to sample size.
Since the model is significant and passes the goodness-of-fit test, we turn to the classification
table.
Accuracy of Prediction
A measure of how well the model performs is in its ability to accurately classify cases into the
two categories of the variable low (whether or not the baby had a low birth weight).
The overall predictive accuracy is 73.5%, shown in Figure 3.9. We are doing much better for
babies of higher birth weight, as the model correctly predicted 119/130, or 91.5% of these cases.
It does a relatively poor job for predicting low birth weight babies, only getting 20/59, or 33.9%
correct. In a clinical setting interest would undoubtedly be in the low birth weight babies, so the
current model would certainly not be acceptable. This illustrates the lack of correspondence
between statistical fit of the model from likelihood statistics, or the significance of individual
variables, and the predictive ability of the model. Finding a significant model does not mean
having high predictability (as with Discriminant).

How much has the prediction been improved over chance? There are two ways to answer this
question. If we know nothing about the independent variables, just the distribution of the
dependent variable, then we could predict all cases fall into the mode (high birth weight). We
would be correct 130/189 or 68.8% of the time, so the improvement to 73.5% seems much less
striking. Another way to answer the question is to require the distribution of birth weights to
remain the same, i.e., that there be 130 cases in the first category and 59 in the second. In that
case, the baseline number of errors would be, from Menard (1995):
f i ( N − f i ) , where N is the number of cases and f is the number of cases in category i.
Errors = ∑ i
i N
Using this formula, the number of errors would be about 81, so the baseline accuracy would be
(189 – 81)/189 = 57.1%. The model’s accuracy of 73.5% is more of an improvement from this
figure, but it still is not acceptable. Next we turn to the effects of individual predictor variables.
Interpreting Logistic Regression Coefficients

We turn to the “Variables in the Equation” table shown in Figure 3.10. This output should look
somewhat similar to the output from OLS regression. Each variable is listed, plus a constant. The
table displays the B (unstandardized) coefficients and their standard errors; a test of significance
based on the Wald statistic; and the Exp(B) column, which is the exponentiated value of the B
coefficient. As with OLS regression, these coefficients are interpreted as estimates for the effect
of a particular variable, controlling for the other variables in the equation.
First, remember that the original model is in terms of the log of the odds ratio, or logit. Therefore,
the B coefficient is the effect of a one-unit change in an independent variable on the log odds. For
the history of hypertension (ht), the effect is to increase the log odds by 1.763, since the variable
is either coded 0 or 1. Thus, we can simply state that hypertension in the mother increases the log
odds by 1.763. But what does this actually mean in terms of the probabilities? To get at this more
intuitive interpretation, the Exp(B) column presents the exponentiated value of B. For the
hypertension variable, this value is 5.831, which is equivalent to e1.736. This value is now
expressed in terms of the odds ratio, so if the mother has hypertension, we estimate that the odds
of her having a low birth weight baby increase by a factor of 5.831.

Figure 3.10 Summary for Variables in the Equation
The next question is: How much does this change the probability? The answer turns out to be “It
depends,” because it depends on where someone starts. If your original odds of having a low birth
weight baby were very small, say 1 to 100, increasing them by a factor of 5.83 increases them
only to 5.83 to 100, still not very high. But if your initial odds are 1 to 1 (or a 50 percent chance),
the odds increase to 5.83 to 1. Unless you are a regular gambler, though, it is probably easier to
express these numbers in terms of probabilities. Here is how it is done.
If the odds of a low birth weight baby are initially 1 to 100, then the probability is related to the
odds by:
odds , so the probability is [(1/100)/(1+(1/100))] or [.01/(1 + .01)], or about .009.
prob =
1 + odds
Increasing the odds by a factor of 5.83 changes the probability to [.0583/(1+.0583)] =.055. This is
not a very striking change in probability. But if the initial odds were 1 to 1, the initial probability
is [1/(1+1)]=.5. Increasing the odds by 5.83 then increases the probability to [5.83/(1+5.83)] or
about .85, a substantial change.
The fact that the logistic regression is nonlinear in the probability (refer to Figure 3.1) is readily
apparent. The same change in the odds ratio, 5.83, changes the probability by a varying amount
depending upon the initial probability for a case.
It is exactly with the exponentiated value of the logistic regression coefficient that the results of
many medical studies are expressed. The results of a study on exercise might state that engaging
in vigorous physical exercise three times a week reduces the odds of having a heart attack by half.
In that instance, the value of the exponentiated coefficient would be .5, acting so that the odds
ratio is reduced by that amount.
Looking at the significance values we see that hypertension (ht), weight of the mother (lwt), and
smoking (smoke) are significant predictors of a low birth weight baby. The categorical variable
race appears three times. First, an overall effect for race is printed, which is significant (p=.028).
This is a very handy summary statistic that is not available in the Regression procedure when
using dummy variables. Then the effect of each of the dummy variables is listed. These should
only be tested if the overall effect of race is significant. For this model, the effect of being white
(race=1) is significant, acting to decrease the odds ratio compared to those of the Other category
(race=3, the reference category). The effect of being black is nonsignificant, so there is no
difference in the effect on the log odds between blacks and other non-whites.

The significance values for the predictor variables are calculated using the Wald statistic, which
is computationally intensive and is distributed as a chi-square. When the absolute value of the
regression coefficient becomes large, the estimated standard error is too large, producing a Wald
statistic that is too small. This leads to failure to reject the null hypothesis when it is, in fact, false
(a Type II error). None of the coefficients in this particular model are too large (this is a judgment
call, in any case).
Making Predictions
Using the regression coefficients, we can easily make predictions about the values of individual
cases. Let us calculate the probability of having a low birth weight baby for a mother of age 20,
who weighed 130 pounds at her last menstrual period, who smokes, has no history of
hypertension or premature labor, does have uterine irritability, is white, and made two visits to
her doctor during the first trimester. The predicted probability can be calculated from:
1
prob(event ) = − (α + Β1 X 1 + Β 2 X 2 +K+ Β k X k )
1+ e
Then for this particular mother, substituting in the appropriate data, we find that her probability is
.397 (see below).
1
prob(event) = −(1.14+.025*20+.032*2 −.014*130−.908+.927 +.649 )
= .397
1+ e
This can be expressed in terms of odds using the equations discussed earlier. This predicted
probability means that SPSS would assign this mother to the group of mothers who would not
have low birth weight babies since the probability is less than .5.
Estimated Probabilities
SPSS creates a low-resolution histogram of the estimated or predicted probabilities, with symbols
denoting the actual group membership. Scroll to the bottom of the output to see this graph. When
the model is successful at predicting group membership, the cases for which the event has
occurred (here, low birth weight babies) should be to the right of 0.5; the cases for which the
event has not occurred (high birth weight babies) should be to the left of 0.5. We can see that
most of the symbols labeled “N” (for weights greater than 2500 grams) are correctly to the left of
0.5, but many of the symbols labeled “L” (for weights less than or equal to 2500 grams) are also
to the left of 0.5. We have already observed this relationship in the classification table: we can
accurately predict which babies have high birth weights, but not those who have low birth
weights.

Figure 3.11 Histogram of Predicted Probabilities
Checking Classifications
It can also be helpful to examine the distribution of the probabilities closely to see how badly
cases are being mispredicted. It might be possible to increase accuracy by changing the cutoff to a
value other than 0.5. In this model, moving the cut point will not help. Even when changing the
cut point might help, when you apply the model to new data you can not be certain the
classification rule will behave the same. As a consequence, the model should be validated when
predictive accuracy is the focus of an analysis.
SPSS saves the predicted probability values into a new variable called PRE_1.
Click the Goto Data tool

Scroll to the last column
If the value labels don’t appear for the prediction PGR_1:

Click the Value Labels tool

Figure 3.12 Predicted Probabilities and Group Membership
It can be useful to isolate the cases that were mispredicted to study their characteristics. When the
predicted and actual values of low do not match, that is an error. This syntax will create a variable
that flags the cases with errors. It can be generated either through a dialog box or the syntax
editor.
COMPUTE PRED_ERR = 0.
IF LOW NE PRG_1 PRED_ERR=1.
Once the variable pred_err is created, it can be used in various exploratory analyses. For
example, Figure 3.13 shows the weight of the mother at the last menstrual period for cases with
and without an error of prediction. It is immediately apparent that, when the model makes an
error, the weight of the mother is lower compared to correctly classified cases. This might be a
clue as to how to improve the model in future analyses, or where attention should be focused in
future, clinical studies.
Figure 3.13 Weight of Mother by Error in Prediction

Residual Analysis
(For Those with Extra Time)
In any regression analysis, it is important to look at the residual and influence statistics to look for
influential or unusual observations or odd patterns in the data. We will do a bit of that here. We
will request two of the influence statistics, but not the DfBeta values because there is one for each
parameter in the model.
Click the Dialog Recall tool , and then click Logistic Regression
Click Cook’s and Leverage values check boxes
Click Unstandardized, Logit, and Standardized check boxes
Click off the check boxes for Probabilities and Group Membership
Figure 3.14 Completed Save New Variables Dialog
The output from the procedure is identical to what we produced above because we did not change
the model specifications. SPSS saved five new variables to the file. To see them:
Switch to the Data Editor window (click the Goto Data tool )
Scroll to the last five columns

Figure 3.15 Data Editor with Newly Saved Influence and Residual Variables
The new variables, as did the previous ones, end with “_1” because they are the first of this type
to be created. The Cook’s and leverage values are very small, which is normal. If a case has no
influence on the regression results, the value of these statistics would be zero. The variable
RES_1 is the residual of the probability of occurrence. The first case has a predicted probability
of .29804, so we predict a high birth weight baby. This was correct, so the actual value of low is
0, and the error of prediction is (0-.29804) = –.29804, as listed for the first case. ZRE_1 is the
standardized residual of the probabilities, and LRE_1 is the residual in terms of the logit.
You can calculate various summary statistics on these new variables, but graphic techniques are
often helpful in looking for patterns and unusual cases. Let us start by looking at the distribution
of the error, using the standardized values.
Click Graphs…Histogram
Move ZRE_1 into the Variable box
Click the Display normal curve check box
Click OK
The histogram of the standardized residuals looks nothing like what you might expect from OLS
regression. It is clearly not normally distributed; it is, in fact, bimodal. At least there are not too
many outliers, or cases with ZRE_1 (a z score) above 2. There are more outliers on the positive
end because we had trouble predicting low birth weight babies.

Figure 3.16 Histogram of Standardized Errors
The absence of cases near a value of zero is because very few predicted probabilities were near
either 0 or 1. This is not uncommon in logistic regressions, especially ones like this where the
pseudo R2 is modest and the predictive accuracy is not outstanding. This distribution of errors,
unlike in OLS regression, should not be of concern. Remember that the distribution of the errors
should be, by definition, binomial.
Let us next try to find any influential cases by creating a scatterplot. The dialog box is not shown.
To do this, you need an ID variable that identifies each case.
Click Graphs…Scatter/Dot then on the Define button

Move COO_1 into the Y-axis box
Move id into the X-axis box
Click OK
Double-click on the chart (to open a Chart Editor window)
You normally do not create a scatterplot with an ID variable, but the purpose of doing so is to
easily isolate cases with a large Cook’s value. Once the scatterplot is created, double-click on it to
open a Chart Editor window. You can see that most cases have Cook’s values very close to zero,
but several cases have much larger Cook’s values. You can use an editing tool to identify these
influential cases.
Click the point id tool
The cursor changes to the same symbol. Now click on the point with the highest Cook’s value.

Figure 3.17 Scatterplot of Cook’s Statistic by ID with an Outlier Identified
This is case 94, which means that it is in row 94 in the Data View sheet of the Data Editor.
Close the Chart Editor window

Scroll to row 94
This mother has some unusual characteristics, as shown in Figure 3.18. Her weight at the last
menstrual cycle was only 95 pounds. She has had three previous premature labors. But despite
this, her baby did not have a low birth weight (the baby’s weight was 3637 grams, or about 8
pounds). At this point, the usual decisions must be made about whether to delete this case from
the analysis, and theoretically, why we made a misprediction, i.e., what other factors are
important. But for our purposes, the key point is to know how to locate these cases in SPSS.
Figure 3.18 Identified Case in the Data Editor

Analysis Tip
Multicollinearity is just as much of a problem in logistic regression as in OLS regression. There
are no diagnostics available in the logistic regression procedure to check for multicollinearity, but
it is still possible to test for its existence. Since multicollinearity involves relationships between
independent variables, you can use the diagnostics in the Regression procedure. In other words,
run regression with the same set of independent variables and the dependent variable. Request
collinearity statistics in the Statistics dialog box, and ignore all the output except those statistics.
We did so for the current data and model and found no sign of multicollinearity (not shown here).
Stepwise Logistic Regression

It is quite common to use a stepwise method of model selection to find the best subset of
independent variables that are good predictors of the dependent variable. There are problems with
using the stepwise methods, especially if the goal of the analysis is predictive accuracy. The
stepwise algorithms find the subset of variables that maximize the likelihood, but this is not the
same as maximizing predictive accuracy. Moreover, the model selected will fit this particular
sample well, but there is no assurance it will do so in a new data set, or that the same model
would be selected in new data. Consequently, model validation is critical if you have enough data
when using stepwise methods of selection.
SPSS supplies several stepwise methods. There are three methods of forward stepwise, all using
the score statistic, an alternative to the Wald statistic, for variable selection. They differ in the
method they use for variable elimination, using either the Wald statistic, the change in the
likelihood, or the conditional statistic. The latter is less computationally intensive and might be
preferred in large samples. In addition, in large samples, all three statistics are equivalent when
the null hypothesis is true. The three backward stepwise methods use one of these three statistics
as a test for the elimination of a variable. Generally speaking, we would recommend use of the
likelihood-ratio change as the criterion.
We will use the birth weight data and create a model using a forward selection method.
Click the Dialog Recall tool , and then click Logistic Regression
Click the Save button and deselect all check boxes, then click Continue
Click the Method drop-down list and select the Forward:LR method
Click OK

Figure 3.19 Logistic Regression with Forward Stepwise Method
The first section of output (Block 0: Beginning Block, not shown) displays the results for the
model with only a constant, including the classification table. The score statistic is used to decide
on variable entry. The largest score statistic, and the smallest probability associated with entry, is
for the variable ptl, history of premature labor (see the Variables not in the Equation table for
Block 0, not shown). It will be entered first.
Similar to stepwise regression, the next section displays the results of each step in the stepwise
analysis as an identified area within the relevant pivot table. For example, the Classification and
“Variables in the Equation” tables will present summaries at each step.
Figure 3.20 Omnibus Tests of Model Coefficients Table
We see that the stepwise analysis, using forward selection with the likelihood ratio statistic, took
three steps. Thus we know that three predictors were selected, although this table does not
identify which predictor entered the equation at each step. The Step summary displays the
statistical significance of the predictor entered at that step, while the model and block summaries
display a test of the coefficients of the overall model at that point. Not surprisingly, the
coefficient at each step is significant, as is the model.

Figure 3.21 Stepwise Model Summary
As additional variables are entered into the model, the goodness-of-fit measure, the –2 log
likelihood statistic, decreases. The pseudo R-square measures increase as additional predictors
enter the model. The R-square measures in the final step (.088, .124) are lower than those found
in the full model (.152 and .213 in Figure 3.7), although all values are fairly small.
Figure 3.22 Hosmer and Lemeshow Test
The Hosmer and Lemeshow fit test is significant for the two-variable model (step 2), but not for
the three-variable model, which is a promising sign for the three-variable model. No result
appears for the first step because a predictor with few distinct values was entered and several of
its values occurred very infrequently, so the table used for the Hosmer and Lemeshow test is too
small (recall that the cases are placed into 10 groups, based on the predicted values, if possible).
The percentages of correct predictions are 67.7, 68.8 and 71.4%, for the one-, two- and three-
predictor models. The model with three variables has a slightly lower percentage of correct
predictions (71.4%) than the model in which all variables were entered (73.5%, see Figure 3.9).

Note also that the one- and two-variable models predict no more accurately overall than a model
containing only the constant (68.8%). However, unlike the constant model, they make correct
predictions for some of the low birth weight babies.
Figure 3.24 Variables in the Equation Table
The Variables in the Equation table details which predictor was entered at each step and presents
model coefficient estimates. We see that history of premature labor (ptl) entered into the equation
in step 1, followed by history of hypertension (ht) in step 2, and mother’s weight at last menstrual
cycle (lwt) in step 3.
Focusing on the results in step 3, the coefficients for the three variables in the equation are fairly
similar to those from the full model (Figure 3.10), especially for ht and lwt. But notice that the
significance value for ptl, .026 in the current model, was .109 in the model with forced entry. And
the variable smoke was significant at the .05 level in the full model, but was not chosen in the
forward selection model.
These are exactly the types of different results that can occur when using various forms of model
building. Neither model is necessarily “more correct” than the other, but a researcher would try to
reconcile the differences. As mentioned previously, stepwise methods of variable selection do not
insure the highest proportion of correct predictions.
Also, a “Model if Term Removed” table (not shown) presents significant tests based in the
change in –2 log likelihood value when each predictor is separately dropped from the model. It
provides a way of verifying that predictors entered early in a stepwise analysis still contribute to
the model when additional predictors have been entered. Although the score statistic was used for
variable selection, the log likelihood statistic comes into use once a variable has been entered into
the equation. For example, in step 1, the change in –2 log likelihood is used to test for removal of
the variable ptl. It is the difference in –2LL between the model with the constant and the model
with ptl added. This statistic, with one degree of freedom, has a probability value below .10, the
criterion for removal, so ptl will not be removed. The same logic is applied to each variable in the
later steps.

Figure 3.25 Variables Not in the Equation (for Step 1)
The score statistics is used to decide on variable entry. The largest score statistic at step 1 (after
ptl was entered), and the smallest probability associated with entry, is for the variable ht (history
of hypertension). It will be entered next. Note that to see the same comparison before step 1, we
need only to turn to the “Variables not in the Equation” table for the Block 0 section of the
results. There we would see that ptl had the highest score statistic, and was entered in the first
step.
It might seem that race should be entered following step 1, but the second dummy variable for
race has a lower score value than that for ht.
ROC Curves
The classification table we examined in Figure 3.10 was useful in providing estimates for
different accuracy and error rates under the classification rule. The rule used was the following: if
the predicted probability of having a low birth weight baby is .5 or greater, then predict a low
birth weight baby. Under this rule we can compute the estimated probabilities for these outcomes:
Outcome Type Description

True Positive Low birth weight baby, correctly predicted
True Negative Normal birth weight baby, correctly predicted
False Positive Normal birth weight baby, incorrectly predicted
False Negative Low birth weight baby, incorrectly predicted
We calculate these estimated probabilities from our earlier classification table (Figure 3.9)
reproduced below.
Figure 3.26 Classification Table from Binary Logistic Regression Analysis

Outcome Basis Estimated Probability

True Positive (20/ (39 + 20)) .339
True Negative (119/ (119 + 11)) .915
False Positive (11/ (119 + 11)) .085
False Negative (39/ (39 + 20)) .661
In medical and other fields, the true-positive probability is called the sensitivity of the
classification rule and the true-negative probability is called the specificity. They are important
because they summarize how medical tests and classification rules perform.
It would be interesting to see how these probabilities vary as the classification (decision) rule
changes. For example, in the context of binary logistic regression, if we used a cutoff other than
.5, how would that influence the hit and error rates? These summaries can be obtained if we know
the predicted probability (or predicted score) of the target event (here a low birth weight baby)
and the actual outcome. The results can be displayed in tabular or graphical form using the ROC
(Receiver Operating Characteristic) Curve procedure.
Basically, the procedure sorts the data by the predicted probability (score) of the event of interest,
then calculates the sensitivity and specificity at each predicted probability (score) change. You
can then examine the graph or table to evaluate the effects of using different cutoffs and the
tradeoffs of the different types of hits and errors. Examination of such a table or plot allows you
to understand the ramifications of changing the cutoff and aids you in evaluating whether a cutoff
different than .5 (the default) better meets your needs.
To illustrate, we will produce a ROC curve using the predicted probabilities from our first binary
logistic model (PRE_1).
Click Graphs…ROC Curve

Move pre_1 into the Test Variable box
Move low into the State Variable box
Type 1 in the Value of State Variable text box
Click With diagonal reference line check box

Figure 3.27 ROC Curve Dialog Box
In the context of logistic regression the Test variable would be the predicted probability of
belonging to the target category, although it could be any score on which a binary prediction is
based. The State variable represents the actual outcome against which the prediction (Test
variable) is compared. The supplied value for the State variable identifies the data value
associated with the target category (here 1, which corresponds to a low birth weight baby).
In addition to the plot, we will request that a table be created as well. This is done so we can more
easily link the plot to the classification table viewed earlier, and in practice this table would not
usually be viewed.
Click Coordinate points of the ROC Curve check box

Click OK

Figure 3.28 ROC Curve
The vertical axis records the sensitivity (true-positive rate) and the horizontal axis represents the
false-positive rate (1 – specificity). The diagonal line provides a reference noting the pattern if
both increased at the same rate, which would occur if the Test variable were unrelated to the State
variable.
The primary use of this plot is to view the tradeoffs between the sensitivity (true-positive) and
false-positive (1 – specificity) rates. You would be interested in areas of the plot in which the
curve rises rapidly, indicating a large increase in true-positives with little change in false
negatives. Also, if you have a desired true-positive rate, you can examine that section of the chart
to see how slightly changing that rate will influence the false-positive (1 – specificity) rate.
To better understand the basis of the ROC curve and link it to the classification table, we will
examine the “Coordinates of the Curve” table.
Scroll down to the Coordinates of the Curve pivot table until the Test Result value is
about .5

Figure 3.29 Coordinates of Curve Pivot Table (Middle of Table Shown)
If we focus on the row with Test Result value .5057878, which is the first test value above .5, we
see the sensitivity value is .339 and the “1 – specificity” value is .085. This first test value above
.5 is relevant to us since Binary Logistic Regression’s default cut point is at .5. The values
appearing in the “Coordinates” table at this point correspond to the true-positive and false-
positive values we calculated from Figure 3.23. Thus, the classification table contributes a single
point in the ROC curve, and the ROC curve extends the classification table by presenting results
under all possible cut points that could be applied to the predicted values from the logistic model.
ROC curves provide valuable information concerning the implications and possible advantages of
changing the cut point for classification in a logistic (discriminant, or some other scoring) model.
Appendix: Comparison to Discriminant Analysis

As a final analysis, we will use the data file on customer's future buying behavior from the
discriminant chapter and run the same model with logistic regression, predicting whether or not
customers say they will buy another VCR.
Click File…Recently Used Data … CSM.sav
After the file has opened, request a logistic regression.
Click Analyze...Regression...Binary Logistic

Move buyyes into the Dependent box
Move age, complain, educ, fail, pinnovat, preliabl, puse, qual, use, value into the
Covariates list box (not shown)
Click the Options button, then click the Classification plots check box
None of the independent variables are nominal in scale, so none need to be declared categorical
covariates.

Scroll down in the Content Pane until you see the Block 1 Omibus Tests of Model Coefficients
table. The model with the ten independent variables is significant, which is hardly surprising (a
chi–square of 221.207 with 10 df, significance less than .0005). The classification table shows
that 81.51% of the cases were correctly predicted, with far better predictions for customers who
are likely to buy again. This is the same pattern we found with discriminant analysis. Figure 2.15
in the discriminant chapter has the comparable table; there, 81.7% of the cases were correctly
classified, a remarkably similar result. 86.5% of those likely to buy another VCR were predicted
correctly versus 92.23% in the logistic regression, but the predictions were worse in logistic
regression for those not likely to buy again.
Figure 3.30 Classification Table and Test of Model
Which variables were found to be significant predictors of whether a customer would buy another
VCR? Only fail, pinnovat, use, and value have probability values below .05 for their B
coefficients. Compare these results to those from discriminant, such as the structure coefficients
in Figure 2.14 in the discriminant chapter. The variables with the highest coefficients were
pinnovat, use, complain, and value. This is similar to the logistic result, but not identical. And, of
course, we had a different ordering using the discriminant function coefficients.
Figure 3.31 Variables in the Equation Predicting Willingness to Buy another VCR

All the variables except educ increased the likelihood of buying another VCR in the discriminant
analysis. This is evident in the logistic results by looking at the Exp(B) column. All other
coefficients are greater than 1, meaning that a higher score increases the odds of being in the
higher category on buyyes. The largest coefficient is for fail. The Exp(B) coefficient for educ is
less than 1, so increasing education leads to lower odds that someone will buy another VCR.
So which analysis technique is to be preferred? On purely statistical grounds, since multivariate

normality cannot hold for the customer data, logistic regression might be preferred. It might also
be preferred because it results in more accurate predictions for future buyers.
Scroll down to the histogram of probabilities
Figure 3.32 Histogram of Predicted Probabilities
This is a typical histogram when the percentage of correct predictions is reasonably high. Most of
the ‘N” symbols occur to the left of .50, and most of the “L” symbols occur to the right of that
cutpoint value. Moreover, note how the probability values are not clustered around .50, but
instead cluster near the lower and upper ends of the distribution, especially for those likely to buy
another VCR.


Chapter 4
Multinomial Logistic Regression
Topics
• Introduction
• Multinomial Logistic Model
• Assumptions of Multinomial Logistic Regression
• A Multinomial Logistic Regression Analysis: Predicting Credit Risk
• Interpreting Coefficients
• Making Predictions
• Appendix: Multinomial Logistic with a Two-Category Outcome
Introduction
In the previous chapter we discussed binary logistic regression, in which the goal is to predict a
two-category outcome from a set of predictor variables, some of which are continuous. This
method has been extended in several ways and some of these extensions are available as
procedures within SPSS.
Multinomial or polytomous Logistic Regression, which can be done using the Multinomial
Logistic Regression procedure, allows the categorical dependent variable to contain more than
two categories. Thus it can be applied to such situations as:
• Predicting which brand (of the major brands) of personal computer an individual will
purchase
• Predicting voting behavior (vote for candidate A, vote for candidate B, vote for candidate
C, don’t vote)
Multinomial logistic regression will be covered in this chapter.
Ordered Logistic Regression allows a rank-ordered dependent variable to be analyzed. Some

examples are:
• Predicting illness outcome (three categories: fully recover, partially recover, die)
• Predicting customer satisfaction (very satisfied, satisfied, not satisfied)
More information on ordered logistic regression can be found in the SPSS Advanced Models
manual.
Multinomial Logistic Regression 4 - 1

Multinomial Logistic Model

First a brief review from the previous chapter. Binary logistic regression can be formulated as:
eα + B1 X 1+ B 2 X 2 +...+ BkXk
Prob(event) =
1 + eα + B1 X 1+ B 2 X 2 +...+ BkXk
Where X1, X2, …, Xk are the predictor variables.
This can also be expressed in terms of the odds of the event occurring.
Prob (event) Prob (event)

Odds (event) = or = eα + B1 X 1+ B 2 X 2+...+ BkXk
1 − Prob (event) Prob (no event)
where the outcome is one of two categories (event, no event). If we take the natural log of the
odds, we have a linear model:
ln (Odds (event)) = α + B1 X 1 + B2 X 2 + ... + Bk X k
With two outcome categories, a single odds ratio summarizes the outcome. However, when there
are more than two outcome categories, ratios of the category probabilities can still describe the
outcome, but additional ratios are required. For example, in the credit risk data used in this
chapter there are three outcome categories: good risk, bad risk– profit, and bad risk – loss.
Suppose we take the Good Risk category as the reference or baseline category and assign integer
codes to the outcome categories for identification: (1) Bad Risk – Profit, (2) Bad Risk – Loss, (3)
Good Risk. For the three categories we can create two probability ratios:
π (1) Prob (Bad Risk - Profit)

g(1) = =
π (3) Prob (Good Risk)
and
π (2) Prob (Bad Risk - Loss)

g(2) = =
Where π (j) is the probability of being in outcome category j.
Each ratio is based on the probability of an outcome category divided by the probability of the
reference or baseline outcome category. The remaining probability ratio (Bad Risk- Profit / Bad
Risk – Loss) can be obtained by taking the ratio of the two ratios shown above. Thus the
information in J outcome categories can be summarized in (J-1) probability ratios.
In addition, these outcome-category probability ratios can be related to predictor variables in a

fashion similar to what we saw in the binary logistic model. Again using the Good Risk outcome
category as the reference or baseline category, we have the following model:

π (1) Prob (Bad Risk - Profit)

ln( ) = ln( ) = α 1 + B11 X 1 + B12 X 2 + ... + B1k X k
and
π (2) Prob (Bad Risk - Loss)

ln( ) = ln( ) = α 2 + B21 X 1 + B22 X 2 + ... + B2 k X k
Notice that there are two sets of coefficients for the three-category outcome case, each describing
the ratio of an outcome category to the reference or baseline category. If we complete this logic
and create a ratio containing the baseline category in the numerator, we would have:

ln( ) = ln( ) = ln(1) = 0
= α 3 + B31 X 1 + B32 X 2 + ... + B3k X k
This implies that the coefficients associated with ln(π (3) ) are all 0 and so are not of interest.
π (3)
Also, the ratio relating any two outcome categories, excluding the baseline, can be easily obtained
by subtracting their respective natural log expressions. Thus:
π (1) π (1) π (2) , or

ln( ) = ln( ) − ln( )
π (2) π (3) π (3)
Prob (Bad Risk - Profit) Prob (Bad Risk - Profit)

ln( ) = ln( )
Prob (Bad Risk - Loss) Prob (Good Risk)
Prob (Bad Risk - Loss)
- ln( )
Prob (Good Risk)
We are interested in predicting the probability of each outcome category for specific values of the
predictor variables. This can be derived from the expressions above. The probability of being in
outcome category j is:
g (j) , where J is the number of outcome categories.

π (j) = J
∑ g (i)
i =1
In our example with the three risk outcome categories, for outcome category (1):
π (1)
g (1) π (3) π (1) π (1)
= = =
g (1) + g ( 2 ) + g (3) π (1) π (2) π (3) π (1) + π (2) + π (3) 1
+ +
π (3) π (3) π (3)
And substituting for the g(j)’s, we have an equation relating the predictor variables to the
outcome category probabilities.

eα1 + B11 X 1 + B12 X 2 +...+ B1k X k

π (1) =
eα1 + B11 X 1 + B12 X 2 +...+ B1k X k + eα 2 + B21 X 1 + B22 X 2 +...+ B2 k X k + eα 3 + B31 X 1 + B32 X 2 +...+ B3 k X k
eα1 + B11 X 1 + B12 X 2 +...+ B1k X k

=
eα1 + B11 X 1 + B12 X 2 +...+ B1k X k + eα 2 + B21 X 1 + B22 X 2 +...+ B2 k X k + 1
In this way, the logic of binary logistic regression can be naturally extended to permit analysis of
dependent (outcome) variables with more than two categories. We will perform a multinomial
logistic regression analysis of a three-category outcome variable using credit risk data.
Assumptions of Multinomial Logistic Regression

The assumptions of multinomial logistic regression, with the exception of the outcome variable
following a multinomial (rather than a binomial) distribution, are identical to those of binary
logistic regression, which were discussed in Chapter 3.
A Multinomial Logistic Analysis: Predicting Credit Risk

We will perform a multinomial logistic analysis that attempts to predict credit risk (three
categories) for individuals based on several financial and demographic predictor variables. We
are interested in fitting a model, interpreting and assessing it, and obtaining a prediction equation.
Possible predictor variables are:
age Age in years

income Income (in thousands of British pounds)
gender Where f=female, m=male
marital Marital status: single, married, divsepwid (divorced, separated or widowed)
numkids Number of dependent children
numcards Number of credit cards
howpaid How often paid: weekly, monthly
mortgage Have a mortgage: y=yes, n=no
storecar Number of store credit cards held
loans Number of other loans
risk Credit risk: 1= bad risk- loss; 2=bad risk- profit; 3= good risk
To access the data:
Move to the c:\Train\AdvStat directory
Double-click on Risk.sav

Figure 4.1 Risk Data
We will not spend time exploring the variables here, but in practice you would be certain to do so
before performing any modeling. In our analysis, in order to keep the model simple, only some of
the predictors will be used. The model can be improved by adding additional predictors and
perhaps by including interaction terms. To run the multinomial logistic regression analysis:
Click Analyze…Regression…Multinomial Logistic
The minimum specification for the procedure is a dependent variable and one Factor (categorical)
or Covariate (scale or continuous) predictor. Unlike the Binary Logistic Procedure, there are
separate list boxes for categorical and continuous predictor variables. Also, note that interaction
effects are added to the model not from the main dialog but from the Model dialog box.
Move risk into the Dependent list box

Move marital and mortgage into the Factor(s) list box
Move income and numkids into the Covariate(s) list box

Figure 4.2 Completed Multinomial Logistic Regression Dialog
As noted above, a reference category is needed for the procedure. By default that category will be
the last one, and this is noted in parentheses after the variable name risk in the Dependent box.
You can change this by clicking on the Reference Category button and selecting another
category.
We could run the multinomial logistic regression analysis now, but will examine the subdialog
boxes and request some additional statistics (the classification table).
Click the Model button
The Model dialog box allows you to specify precisely the model you wish. By default, a model
including the main effects (no interactions) of factors and covariates will be run. This is similar to
what the Regression and Binary Logistic Regression procedures do (unless interaction terms are
formally added). The Full factorial option would fit a model including all factor interactions (in
our example, with two factors, the two-way interaction of marital and mortgage would be added).
Also, an intercept is included by default.

Figure 4.3 Multinomial Logistic Regression: Model Dialog
The Custom/Stepwise Terms option permits you to fully specify the model you wish to test, which
can include interactions involving factors and covariates. To do stepwise multinomial logistic
regression, place the appropriate terms in the Stepwise Terms list box.
Click the Cancel button

Click the Classification table check box
By default, summary statistics and (partial) likelihood ratio tests for each effect in the model
appear in the output. Also, 95% confidence bands will be calculated for the parameter estimates.
We have requested a classification table so we can assess how well the model predicts cases into
the three risk categories.
In addition, a table of observed and expected cell probabilities can be requested. Note that, by
default, cells are defined by each unique combination of a covariate and factor pattern, and a
response category. Since a continuous predictor (income) is used in our analysis, the number of
cell patterns is very large and each might have but a single observation. Goodness of fit chi-
square statistics can be requested to evaluate the model. However, the subgroups on which these
are based are constructed from the combinations of covariate and factor patterns. In our analysis
we would have many cells with small counts, yielding unstable results. This can be controlled by
specifying that the subpopulations for the goodness of fit tests and cell probabilities be based on
the covariate patterns for a restricted set of predictors (see Define Subpopulation area at the
bottom of the dialog). In this analysis, we will forego goodness of fit statistics. Finally, the
asymptotic correlation of parameter estimates can provide a warning for multicollinearity
problems (when high correlations are found among parameter estimates).

Figure 4.4 Multinomial Logistic Regression: Statistics Dialog
Click Continue
In logistic regression the expected variance of the dependent variable can be compared to the
observed variance, and discrepancies may be considered under- or over dispersion (lower or
higher variation, respectively, in the outcome than is expected by theory). If there is moderate
discrepancy, standard errors will be incorrect; in particular over-dispersion will cause the
standard errors to be over-optimistic (too narrow) and one should use adjusted standard error,
which will make the confidence intervals wider. However, if there are large discrepancies, this
indicates a need to respecify the model, or that the sample was not random (e.g., a clustered
sample), or other serious design problems.
The Scale option allows adjustment to the estimated parameter variance-covariance matrix based
on over-dispersion. The details of such adjustment are beyond the scope of this course, but you
can find some discussion in McCullagh and Nelder (1989).

Figure 4.5 Multinomial Logistic Regression Options Dialog
The other choices in this dialog concern stepwise models, including the entry and removal
probability values, the type of test done for entry/removal, the hierarchy of terms, and the
minimum (set to 0 by default), and maximum (no limit by default) number of terms in a stepwise
model. These last choices might be used in very specific circumstances; for example, if you had
many potential predictors, you might decide to limit any stepwise model to only 5 terms for
parsimony.
Click Cancel

Figure 4.6 Multinomial Logistic Regression: Save Dialog
The Multinomial Logistic Regression procedure, like Regression, Logistic Regression, and
Discriminant, can save new variables containing predicted values and other case statistics as new
variables, and model information in an XML (Extensible Markup Language) format. Thus other
applications, such as web-based applications which are capable of parsing XML files, could use
the model information to create predictions for new cases.
The Multinomial Logistic Regression procedure can save an estimated response probability for
each outcome category of the dependent variable, the predicted category (which is the outcome
category with the highest predicted probability), the estimated probability of a case being in the
predicted outcome category, and for comparison purposes, the observed probability of cases with
the same factor/covariate pattern belonging to the actual outcome category.
For this analysis, we will not save predicted outcomes and probabilities, but we could do so and
work with them in ways similar to what we did in Chapter 3.
Click the Cancel button
The Multinomial Logistic Regression Criteria dialog (not shown) controls technical convergence
criteria and would generally be used by experienced analysts if the initial analysis fails to
converge to a solution.
Click OK
Multinomial Logistic first displays a case processing summary indicating how many cases are
used in the analysis—all the cases, or 2,455, in this instance.

Figure 4.7 Case Processing Summary
The marginal frequencies and percentages of the factors and dependent variable are reported,
along with a summary of the number of valid and missing cases. A case must have valid values
on all factors, covariates, and the dependent variable in order to be included in the analysis. The
footnote tells us that there are 2,418 distinct patterns (cells) in the analysis because covariates
with many distinct values are included in the model.
Figure 4.8 Model Fit and Pseudo R-Square Summaries
Next we view the Model Fitting Information table. We have seen these tables in the context of
binary logistic regression. The Final model chi-square statistic tests the null hypothesis that all
model coefficients are zero in the population, equivalent to the overall F test in regression. With
ten degrees of freedom that correspond to the parameters in the model (see below), it is based on
the change in –2LL (–2 log likelihood) from the initial model (with just the intercepts) to the final
model, and is highly significant. Thus at least some effect in the model is significant.
As discussed in Chapter 3, pseudo R-square measures try to measure the amount of variation (as
functions of the chi-square lack of fit) accounted for by the model. The model explains only a
modest amount of the variation (the maximum is 1, and some measures cannot reach this value).

Figure 4.9 (Partial) Likelihood Ratio Tests
The Model Fitting Information table in Figure 4.8 provides an omnibus test of effects in the
model. In the Likelihood Ratio Tests table in Figure 4.9 we have a test of significance for each
effect (in this case the main effect of a predictor variable) after adjusting for the other effects in
the model. The caption explains how it is calculated. All effects are highly significant. Notice that
the intercepts are not tested in this way, but tests of the individual intercepts can be found in the
Parameter Estimates pivot table.
Note also that the values in the df (degrees of freedom) column are double what you would
expect for a binary logistic regression model. For example, the covariate income, which is
continuous, has two degrees of freedom. This is because with three outcome categories, there are
two probability ratios to be fit, doubling the number of parameters. Income has by far the largest
chi-square value compared to the other predictors with two (or even four) degrees of freedom.
Interpreting Coefficients
The Parameter Estimates table contains the coefficient information for the parameters in the
model. The most striking feature of the table is that there are two sets of parameters. One set is
for the probability ratio of “bad risk – loss” to “good risk,” which is labeled “bad loss.” The other
set is for the probability ratio of “bad risk – profit” to “good risk,” labeled “bad profit.”

Figure 4.10 Parameter Estimates
The summary columns are identical to those we saw with binary logistic regression. For each of
the two outcome probability ratios, each predictor is listed, plus an intercept, with the estimated B
coefficients and their standard errors; a test of significance based on the Wald statistic; and the
Exp(B) column, which is the exponentiated value of the estimated B coefficient, along with its
95% confidence interval. As with ordinary and logistic regression, these coefficients are
interpreted as estimates for the effect of a particular variable, controlling for the other variables in
the equation.
Recall that the original (linear) model is in terms of the natural log of a probability ratio. The
intercept represents the log of the expected probability ratio of two outcome categories when all
covariates are zero and all factor variables are set to their reference category (last group) values.
For covariates, the B coefficient is the effect of a one-unit change in the independent variable on
the log of the probability ratio. Examining income in the “bad loss” section, an income increase of
1 unit (equivalent to 1,000 British pounds) decreases the log of the probability ratio between “bad
loss” and “good risk” by –.056 (–5.63E-02). But what does this mean in terms of probabilities?
Moving to the Exp(B) column, we see the value is .945 for income (in the “bad loss” section of
the table). Thus increasing income by 1 unit (or 1,000 British pounds) decreased the expected
ratio of the probability of being a bad loss to the probability of being a good risk. In other words,
increasing income reduces the expected probability of being a “bad loss” relative to being a “good
risk,” and this reduction is .945 per one unit (1,000 British pounds) increase in income. If we
examine the income coefficient in the “bad profit” section of the table, we see that in a similar
way (Exp(B) = .878) the expected probability of being a “bad profit” relative to being a good risk
decreases as income increases. Thus increasing income, after controlling for the other variables in
the equation, is associated with decreasing the probability of having a “bad loss” or “bad profit”
outcome relative to being a “good risk.” This relationship is quantified by the values in the
Exp(B) column and the Sig(nificance) column indicates that both coefficients are statistically
significant.
Turning to the number of children (Numkids), we see that its coefficient is significant for the
“bad loss” ratio, but not the “bad profit” ratio. Examining the Exp(B) column for numkids in the

“bad loss” section, the coefficient estimate is 2.267. For each additional child (one unit increase
in numkids), the expected ratio of the probability of being a “bad loss” to being a “good risk”
more than doubles. Thus, controlling for other predictors, each additional child (one unit increase)
doubles the expected probability of being a “bad loss” relative to a “good risk”. However,
controlling for the other predictors, the number of children has no significant effect on the
probability ratio of being a “bad profit” relative to a “good risk.”
The Multinomial Logistic Regression procedure uses a General Linear Model coding scheme.
Thus for each categorical predictor (here marital and mortgage), the last category value is made
the reference category and the other coefficients for that predictor are interpreted as offsets from
the reference category. In examining the table we see that the last categories for marital (single)
and mortgage (y) have B coefficients fixed at 0. Because of this the coefficient of any other
category can be interpreted as the change associated with shifting from the reference category to
the category of interest, controlling for the other predictors. Since the reference category
coefficients are fixed at 0, they have no associated statistical tests or confidence bands.
Looking at the marital variable, its two coefficients (for divsepwid and married categories) are
significant for both the “bad loss” and “bad profit” summaries. In the “bad loss” section, we see
the estimated Exp(B) coefficient for the “Marital=divsepwid” category is .284, while that for
“Marital=married” = 2.891. Thus we could say that, after controlling for other predictors, shifting
from the single group to the divorced, separated or widowed group results in about a 2/3
reduction (from 1 to .284) in the expected ratio of the probability of being a “bad loss” relative to
a “good risk.” Thus the divorced, separated or widowed group is expected to have fewer “bad
losses” relative to “good risks” than the single group. On the other hand, the married group is
expected to have a much higher (by a factor of almost 3) proportion of “bad losses” relative to
“good risks” than the single group. The explanation of why being married versus single should be
associated with an increase of “bad losses” relative to “good risks” should be worked out by the
analyst, in conjunction with someone familiar with the credit industry (domain expert). If we
examine the marital Exp(B) coefficients for the “bad profit” ratios, we find a very similar result.
Finally, mortgage is significant for both the “bad loss” and “bad profit” ratios. Since having a
mortgage (coded y) is the reference category, examining the Exp(B) coefficients shows that
compared to the group with a mortgage, those without a mortgage have a greater expected
probability of being “bad losses” (1.828) or “bad profits” (2.526) relative to “good risks.” In
short, those without mortgages are less likely to be good risks, controlling for the other predictors.
In this way, the statistical significance of predictors can be determined and the coefficients
interpreted. Note that if a predictor were not significant in the Likelihood Ratio Tests table, then
the model should be rerun after dropping the variable. Although numkids is not significant for
both sets of category ratios, the joint test (Likelihood Ratio Test) indicates it is significant and so
we would retain it.
Classification Table
The classification table, sometimes called a misclassification table or confusion matrix, provides
a measure of how well the model performs. With three outcome categories we are interested in
the overall accuracy of model classification, the accuracy for each of the individual outcome
categories, and patterns in the errors.

The rows of the table represent the actual outcome categories while the columns are the predicted
outcome categories. We see that overall, the predictive accuracy of the model is 62.4%. Although
marginal counts do not appear in the table, by adding the counts within each row we find that the
most common outcome category is bad profit (1,475 from Figure 4.7). This constitutes 60.1%
percent of all cases. Thus the overall predictive accuracy of our model is not much of an
improvement over the simple rule of always predicting “bad risk – profit.” However, we should
recall that this simple rule would never make a prediction of “bad risk – loss” or “good risk.”
In examining the individual outcome categories, the “bad risk – profit” group is predicted most
accurately (87.3%), while the other categories, “bad risk – loss” (15.9%) and “good risk” (36.8%)
are predicted with much less accuracy. Not surprisingly, most errors in prediction for these latter
two outcome categories are predicted to be “bad risk– profit.”
The classification table thus allows us to evaluate a model from the perspective of predictive
accuracy. Whether this model would be adequate depends in part on the value of correct
predictions and the cost of errors. Given the modest improvement of this model over simply
classifying all cases as “bad risk – profit,” in practice, an analyst would see if the model could be
improved by adding additional predictors and perhaps some interaction terms.
Finally, it is important to note that the predictions were evaluated on the same data used to fit the
model and for this reason may be optimistic. A better procedure is to keep a separate validation
sample on which to evaluate the predictive accuracy of the model.
Making Predictions
We now have the estimated model coefficients. How would we go about creating our own
predictions from the model? To illustrate let’s take an individual who is single, has a mortgage,
no children, and has an income of 35,000 British pounds. What is the predicted probability of her
(although gender was not included in the model) being in each of the three risk categories? Into
which risk category would the model place her?
Earlier in this chapter we showed the following (where π (j) is the probability of being in
outcome category j):
g (j) , where J is the number of outcome categories

π (j) = J
∑ g (i )
i =1

If we substitute the parameter estimates in order to obtain the estimated probability ratios, we
have:
gˆ (1) = e.438−.056*Income+.818*Numkids −1.260*Marital1 +1.062*Marital2 +.603*Mortgage1

gˆ (2) = e 4.285−.130*Income+.153*Numkids −1.220*Marital1 +1.021*Marital2 +.927*Mortgage1
and
gˆ (3) = 1
Where because of the coding scheme for the categorical predictors (Factors):
Marital1 = 1 if Marital=divsepwid; 0 otherwise

Marital2 = 1 if Marital=married; 0 otherwise
Mortgage1 = 1 if Mortgage=n; 0 otherwise
Thus for our hypothetical individual, the estimated probability ratios are:
gˆ (1) = e.438−.056*35.0+.818*0−1.260*0+1.062*0+.603*0 = e.−1.522 = .218

gˆ (2) = e 4.285−.130*35.0+.153*0−1.220*0+1.021*0+.927*0 = e −.265 = .767
gˆ (3) = 1
And the estimated probabilities are:
.218
πˆ (1) = = .110
.218 + .767 + 1
.767
πˆ (2) = = .386
.218 + .767 + 1
1
πˆ (3) = = .504
.218 + .767 + 1
Since the third group (good risk) has the greatest expected probability (.504), the model predicts
that the individual belongs to that group. The next most likely group to which the individual
would be assigned would be group 2 (bad risk – profit) because its expected probability is the
next largest (.386).
In this way, once the model is established, predicted probabilities and outcomes can be obtained
with relatively simple equations.

Extensions
In this chapter we examined only main effects within multinomial logistic regression. Interaction
effects can be incorporated into models by specifying them within the Multinomial Logistic
Regression Model subdialog box.
Appendix: Multinomial Logistic With a Two-Category

Outcome
In Chapter 3 we used the Binary Logistic Regression procedure when predicting the two-category
outcome (low birth weight, normal birth weight) for newborns. Since the multinomial analysis is
an extension of the binary logistic analysis, we would expect that running a binary logistic
analysis using the multinomial procedure would produce the same results. In this appendix we
will demonstrate this by analyzing the birth weight data from Chapter 3 using Multinomial
Logistic Regression.
Double-click on Birthwt.sav
Figure 4.12 Birth Weight Data
The dependent variable is low (low birth weight: 0= normal, 1= low) and we will use the three
predictors (ht: history of hypertension, lwt: mother’s weight at last menstrual period, and ptl:
history of premature labor) that were selected using stepwise logistic regression (see Chapter 3).
Click Analyze…Regression…Multinomial Logistic

Move low into the Dependent list box
Move ht, lwt and ptl into the Covariate(s) list box

Figure 4.13 Multinomial Logistic Regression Dialog
Since ht (history of hypertension) is coded 0 (no history) or 1 (history), it could also have been
specified as a Factor. Here we add it as a covariate to more closely match the binary logistic
regression setup.

Click the Classification table check box (not shown)
Click Continue
Click OK
To make the comparison easier with Binary Logistic we first reproduce the Variables in the
Equation table from the binary logistic analysis (Figure 3.24).
Figure 4.14 Parameter Estimates for Birth Weight Analysis Using the Binary Logistic
Regression Procedure
Next we display the equivalent table from Multinomial Logistic.

Figure 4.15 Parameter Estimates for Birth Weight Using Multinomial Logistic Regression
Except for a reversal of the signs of the coefficients, the parameter estimates and supporting
statistics for binary (Step 3 results in Figure 4.14) and multinomial logistic are the same. The
reason for the coefficient sign reversal is that the Binary Logistic procedure places the probability
of the last outcome category (here 1; low birth weight) in the numerator of the odds ratio. The
Multinomial Logistic procedure uses the last outcome category as the reference or baseline
category, which appears in the denominator of the probability ratio. So the probability ratio of
Binary Logistic is the inverse of that used by Multinomial Logistic. We could eliminate the sign
change by reversing the (0,1) coding of low in one or the other analysis. Also as a consequence of
coding, the Exp(B) values are the inverse of the value in the other table.
Next, we examine the classification table results. The classification table from binary logistic
regression is reproduced from Figure 3.23.
Figure 4.16 Classification Table from Binary Logistic Regression
Then, look at the table from Multinomial Logistic.
Figure 4.17 Classification Table from Multinomial Logistic Regression

As expected, the model classification table results for the final binary logistic regression model
(Step 3 results in Figure 4.16) match those from the multinomial logistic regression analysis.
Although both procedures yield the same results, keep in mind that they do provide slightly
different choices in various options, especially concerning stepwise methods.

Chapter 5
Survival Analysis
Topics
• What is Survival Analysis?
• Concepts
• Censoring
• What to Look for in Survival Analysis
• Survival Procedures in SPSS
• Example: Kaplan-Meier
• Cox Regression
• Example: Cox Regression
• Checking the Proportional Hazards Assumption
• Appendix: Brief Example of Cox Regression with a Time-Varying Covariate
Introduction
Survival Analysis studies the length of time to a critical event. The analysis can involve a single
group, or can investigate survival time as a function of categorical or interval scale predictor
variables. It is often used in medical research to study the amount of time patients survive
following onset of a disease, or time spent in remission. Other applications include time to failure
for electrical or mechanical components, the time employees spend with a company, time to
complete a complex transaction, such as a loan application or a house purchase, and time to
promotion within an organization.
In this chapter we will describe common survival analysis applications and review the main
concepts. SPSS contains three procedures that perform survival analysis; we will discuss their
differences and suggest when to use each. We provide a detailed example of one of these
procedures by running a Kaplan-Meier analysis comparing survival time for treatment and control
groups in a medical study of chronic hepatitis. In addition, we will introduce Cox Regression by
modeling time spent in an addiction treatment program as a function of several client
characteristics. An extension of this analysis (Cox Regression with a time-varying covariate) is
introduced in the appendix to this chapter.
Survival Analysis 5 - 1
What is Survival Analysis?

As mentioned in the introduction, survival analysis examines the length of time to a critical event,
possibly as a function of predictor variables. If the critical event is death, then there is great
interest in survival time for different populations by life insurers. Similarly, medical researchers
are keenly interested in survival following onset and treatment of specific diseases. In
engineering, the time to failure of a component is of considerable interest. A researcher studying
employee satisfaction may be interested in time it takes for promotion for different groups, or the
time employees stay at a company. Some additional examples appear below:
• Time in remission following chemotherapy for a particular cancer when patients may
outlive the study
• Compare time to complete a process (loan application, house purchase, and military
induction) in different locations or using different procedures
• Renters of commercial space are interested in predicting the length of time a tenant will
remain in the space, and in how long a space will stay empty
• Time to failure of an expensive component when the component may be replaced before
failing as part of regular maintenance
• Time it takes to complete a requirement (for example, a Ph.D. at a university)
• Lengths of time different groups of customers maintain checking accounts at a bank.
Since time has interval scale properties (actually ratio scale), such methods as regression or
ANOVA might first come to mind when considering these examples. However, survival data files
often contain censored values: observations for which the final outcome is not known, yet some
information is available. For example, ordinary regression has no easy way of handling a data
value of 15+ years, meaning we know the patient has survived at least 15 years so far, with an
unknown future survival. Survival analysis can explicitly incorporate such information.
Within the general area of survival analysis there are levels of complexity. One might simply
desire an actuarial table for a single group. There might be several groups (based on
demographics, or different treatment groups in medical studies) that you wish to compare in
terms of survival time and test for significant differences. You might have background
information (interval or categorical) on individuals and wish to build a prediction model for
survival time. Each of these applications is available using one of the survival procedures within
SPSS.
Concepts
The main outcome measure in survival analysis is the length of time to a critical event. Such
summaries as the mean and median survival time with their accompanying standard errors are
useful summaries. The mean survival time is not simply the sample mean value, but is estimated
using the cumulative survival plot (discussed below) that adjusts for censored data.
An important summary is the survival function over time. The cumulative survival plot (or table)
displays the estimated probability of surviving longer than the time displayed in the chart (or
table). An example of a cumulative survival plot for two treatment groups in a medical study
appears below.
Figure 5.1 Survival Plot for Two Treatment Groups
We see the probability of surviving longer than a given time starts out at 1.0 (when time= 0) and
decreases over time. The probability is adjusted at the time of each critical event (here a death).
There were censored observations (patients who could not be contacted beyond a certain point,
died of other causes, or outlived the study) that appear as small plus signs (+) along the plot. They
are used in the denominator when calculating survival probability up until the time of their
censoring (since they are known to be alive up until that point), and discarded thereafter.
Note that the plot is an empirical curve, adjusted for each critical event. This approach does not
make distribution assumptions about the shape of the curve and is called nonparametric. This is
the approach taken by the Kaplan-Meier procedure in SPSS; there are parametric models of
survival analysis that would fit smooth curves to the data viewed above. For this reason, when
comparing survival functions of different groups using the Kaplan-Meier procedure,
nonparametric tests are used.
A related measure is called the hazard function. It measures the rate of occurrence per unit time of
the critical event, at a given instant. As such it represents the risk of the event occurring at a
particular time. A medical researcher would be interested in looking at time points with elevated
hazard rates, as would a renter of retail space, since these would be times of greater risk of the
event (death or tenant leaving, respectively). SPSS can produce cumulative hazard plots. The
survival function can be related to the cumulative hazard function.
For more detailed discussion of these issues see Lee (1992), Kleinbaum (1996), or Klein and
Moeschberger (1997).
Censoring
A distinguishing feature of survival analysis is its use of censored data. Users of other advanced
statistical methods are familiar with missing data, which are observations for which no valid
value is recorded. Censored values contain some information, but the final value is hidden or not
yet known. For example, in a medical study follow-up, after 65 months a patient may have moved
and thus be unreachable, or perhaps at the end of the data collection phase for the study, the
patient is still alive, outliving the study. The researcher would thus know that the patient survived
the illness for at least 65 months, but does not know the final value. Some examples of censoring
are:
• Data are collected for length of time renting retail space, and at the end of data collection
some tenants are still renting.
• Medical studies: at the end of the study some patients are still alive or in remission, some
move or refuse to participate, some die of other causes (for example accidents).
• Studying time to promotion: some employees left the company before being promoted.
Censored data is included in the calculation of the survival function. A censored case is used in
the denominator to calculate survival probabilities for data involving events occurring with
shorter time values than the censored case. As an example, if we know a patient survives 60
months and is then censored, use is made of the fact that the patient did not die during the first 60
months. After the time of censoring, the censored value is dropped from any survival
calculations. Returning to our example, we don’t know how much beyond 60 months the patient
survived, so he is not used in calculating the survival function beyond that point. In this way
survival analysis makes use of censored data. In both survival tables and plots, censored events
are noted.
What to look for in Survival Analysis

When working with a single group of data there is interest in examining the cumulative survival
curve that shows the estimated probability of surviving beyond the end of each time period. In
addition you want the estimated means or medians of survival time, along with their standard
errors.
When several groups are involved, the cumulative survival plot can display each, allowing
comparisons of the individual curves. Also, the survival time means and medians can be
compared, and significance tests can be performed to test for group differences in survival
distributions.
Finally, if you have data containing categorical and interval scale variables you believe may be
predictive of survival, then Cox Regression can be used. It can perform significance tests on
individual predictors, perform stepwise analyses, display estimated coefficients for a prediction
equation, and produce cumulative survival and hazard plots for different subgroups.
For example, suppose a company manages retail space in two mall locations near a city and is
interested in estimating the length of time tenants stay. Rental records are available so it is known
how long previous tenants stayed, along with some information about their business (type of
shop, whether part of a national chain, size of space, rental price). For current tenants, we know
how long they have stayed to date, but do not know how long they will actually stay (censored
data). The company wants to compare the two malls in terms of length of stay and hopes to build
a predictive model based on characteristics of the businesses. Since some data are censored,
survival analysis should be used to answer these questions.
Survival Procedures in SPSS

SPSS contains three procedures that perform survival analysis. You would generally choose
among them based on the granularity or precision with which time to the critical event is
recorded, whether there are group factors of interest, and whether interval scale predictor
variables are to be used.
Below we show the choices under the Survival menu.
Figure 5.2 SPSS Survival Procedures
The Life Tables procedure is appropriate when the time to critical event measure is recorded in
broad ranges (for example in six-month periods, or whole years) so that there are many ties
among the data values, or if there is no interest in differentiating between small time differences.
It collapses the time measure into ordered categories, each based on a range of time values, and
produces summaries based on the data aggregated at this level. It can perform significance tests
comparing groups on a single factor. It will also produce actuarial tables based on the time
ranges.
The Kaplan-Meier procedure is appropriate when the time to critical event measure is precise
enough so there are relatively few ties in the data. Examples might be number of months
surviving, or the fractional number of years a retail space is occupied by a tenant. Besides
producing survival tables and plots, Kaplan-Meier can perform significance tests comparing
groups differing on a single factor. In addition, a second grouping variable (Strata variable) can
be specified. Tests comparing the factor levels can be performed separately for each strata level.
Cox Regression is also called a proportional hazard model. It posits that the hazard rate can be a
function of both categorical and interval scale predictor variables. Thus it is more general than the
previously discussed survival routines. It does assume that the hazard functions for different
groups are proportional to each other over time. This assumption can be examined and a variant
of Cox regression (Cox Regression with time varying covariates) can be applied when the
assumption doesn’t hold.
Since much data collected contains fairly precise measures of the time to critical event with
relatively few ties, of the three procedures, Kaplan-Meier and Cox regression are used most in
practice.
An Example: Kaplan-Meier
Hand et al. (1994) present data from a study by Kirk et al. (1980) in which survival time for
patients with chronic active hepatitis was compared for treatment (prednisolone therapy) and
control groups. We are interested in the survival functions and in testing whether treatment makes
a significant difference in the survival function.
Click File…Open..Data
Move to the c:\Train\Advstat directory if necessary
Double-click on KM.sav
Click the Value Labels tool if not already depressed
Figure 5.3 Survival Data (Chronic Active Hepatitis)
Labels are displayed in the Data Editor. The group variable divides the data into Treatment (1)
and Control (2) patients. The time variable records time to the critical event (death) or time when
censoring occurred; most of the censoring was due to patients outliving the study. The status
variable indicates whether the critical event occurred (1= death) or that the case was censored (2=
censored). Since time is measured in months, and judging from the cases visible in the Data
Editor window we expect few ties, we will proceed directly to the Kaplan-Meier procedure.
Click Analyze…Survival..Kaplan-Meier
Move the time variable into the Time list box
Move the group variable into the Factor list box
Move the status variable into the Status list box
We indicate that time measures the time to the critical event, and that status defines the status of
an observation (whether the critical event—here death—occurred). Once a variable is placed in
the Status list box a question mark in parenthesis appears beside it and the Define Event
pushbutton becomes active. We must now provide the status data value(s) that indicates the event
occurred. For our data, a status value of 1 means the patient died.
Click the Define Event button (note the Status variable must be selected)
Type 1 in the Single Value text box
Click Continue
In most instances there is a single code indicating that the critical event occurred. However, the
Kaplan-Meier procedure allows for multiple values to signal the event, and you can specify a
range or list of such values. For example, in a medical study of cancer remission the critical event
may be the discovery of a new tumor. Here different event codes may be recorded for different
sized tumors, although any of these codes mark the event.
Figure 5.4 Kaplan-Meier Dialog Box with Event Defined
The Strata box permits you to specify a categorical variable within which the factor variable will
be stratified (nested). This allows the survival analysis to be performed separately for each strata
group. For example, if the study recorded gender and we wished to examine treatment versus
control differences separately for men and for women, we would specify gender as a strata
variable. It is important to note that significance tests are not performed on strata differences, but
only on the factor variable either nested within or pooled across strata. If we wish to explicitly
test for the survival effects of more than a single factor, we must turn to Cox regression.
By default the survival tables produced by Kaplan-Meier will use the Time variable to label the
observation. If you supply a Label Cases by: variable then SPSS will use its value labels (the first
20 characters) to label the case in survival tables.
Since we wish to test for group differences in survival, we click the Compare Factor button.
Click the Compare Factor button

Click Log Rank, Breslow and Tarone-Ware for Test Statistics
Click Continue
Figure 5.5 Requesting Significance Tests
We’ve requested all three available tests of differences in survival distributions across factor
groups. Each test statistic is based on a comparison of the number of critical events (here death)
observed to the expected number of events at each event time point. The expected number of
events is derived from the number of cases at risk and the number for which the critical event
occurred at an event time point. If there are no factor differences then the expected numbers of
events (derived from data pooled across factor levels) should closely fit the observed numbers for
the different factor groups. The tests differ in how each event is weighted when calculating the
overall statistic. The log rank test weights each event the same whether it occurs early or late
along the time scale. The Breslow test weights each event by the number at risk. Since the
number at risk decreases over time, early events are weighted more heavily than later events in
the Breslow statistic. The Tarone-Ware weights the events by the square root of the number at
risk, and so weights early events less than the Breslow test, but more than the log rank test. What
is the implication of all this? One issue is that if differences showing up early are important to
you, the Breslow is sensitive to this.
If the levels of the factor variable make up an interval scale (for example they represent drug
dosage levels or price groups for retail space) you can also request a significance test for a linear
trend across the factor levels related to survival. The remaining pushbuttons permit control over
how you want tests involving a Strata variable to be done (tests of factor differences pooled
across Strata levels, or performed separately within each Strata level). Also, you can choose
whether you want pairwise tests between all pairs of factor levels (when there are more than two
factor levels) performed, and if so should these tests be performed pooling across Strata levels, or
done separately within each. Since we have no Strata variable and only two levels for the factor,
these options are not relevant.

Click the Survival and Hazard check boxes in Plot area
Click Continue
Figure 5.6 Requesting Survival Plots
By default, Kaplan-Meier will print the survival table and estimated survival means and medians
for each factor level group. Here we request that cumulative survival and hazard plots (discussed
in the Concepts section) be displayed. Usually, one or the other is requested, but we will view
both. You can also request a one minus survival plot. It displays 1 - Survival, an increasing
function over time, which is a form that some researchers prefer. Another available plot is the log
survival plot; it displays the cumulative survival function in a log scale. This stretches the vertical
axis from a 0 to 1 scale into a minus infinity to 0 scale.
Figure 5.7 Save Options
SPSS can save various results of the analysis as new variables. The estimated cumulative survival
probability can be saved with its standard error, as can the cumulative hazard function value.
These values are not commonly saved for later analyses.
Click Cancel
Click OK
Results
We first see a Case Processing Summary table, which is useful in this instance because it tells us
how many cases are in each factor group, and how many are censored in each group. There are 22
cases in each group, and there are more censored cases (50%) in the treatment group. If the
treatment was effective at delaying death, than we would expect more censored cases (unless the
study period was long enough to follow every subject until death).
Figure 5.8 Case Summary
Next is the Survival Table, displaying detailed statistics for each group. For display purposes, we
have activated the table and moved the groups to the layer.
Figure 5.9 Survival Table for Control Group
The data for the control group are ordered by survival time (in months). If we had supplied a
Label variable in the Kaplan-Meier dialog box, its values would appear in the first column. The
Status column indicates whether the event occurred or the case was censored. The next two
columns contain the estimates for the cumulative survival function and standard errors for all
cases for which an event occurred. We see from the bottom of the table that there were twenty-
two control cases. Thus for the first event which occurred at time 2 (2 months), the cumulative
survival value is (1 – (1/22)), or .955. The estimated probability of surviving beyond two months
in this group is 95.5%. At this point one critical event has occurred (N of Cumulative Events
column) and 21 observations remain with greater time values (N of Remaining Cases column).
For the second event, the cumulative survival value is equal to the probability of surviving
beyond the first event multiplied by the probability of surviving beyond the second event.
Therefore, the cumulative survival value is (1 – (1/22)*(1 – (1/21)), or .909 (estimated probability
for a control group member surviving beyond 3 months is 90.9%). Note that the denominator
value for these calculations includes the 6 censored cases, since their survival times exceed the
values examined so far (2, 3). Note also that survival function values are not computed for
censored events since the true time to critical event is not known.
Figure 5.10 Survival Table for Treatment Group
Figure 5.10 shows the same information for the treatment group. If we focus attention on the fifth
row in the table (time=56 months) we see a cumulative survival value is not calculated because
the case is censored (the patient survived). However, notice that the value in the “Number
Remaining” column decreases by one for this case. Thus the censored case is excluded from any
survival calculations for cases with greater values for time to critical event. However, it was used
in the denominator of the survival function calculations for cases with lesser values for time to
critical event. In this way, survival analysis makes use of the censored data information. Note that
50% of the cases in this group were censored, mostly at large time values. This is similar to the
pattern in the control group and probably due to patients outliving the end of the study.
The next portion of output is the Mean and Medians for Survival Time table, which displays the
estimated mean and median survival time for both groups, along with their standard errors. Note
that the mean and median are quite disparate for both groups; this is not surprising since time
cannot be less than 0, yet can take on large values (positive skew). Recall that the mean is not the
regular arithmetic mean of the sample, but is estimated from the survival curve (calculation is
based on the area under the curve). Since the observation with the greatest time value was
censored, a note appears to that effect. Finally, 95% confidence bands are displayed for the
estimated mean and median. They are not measured precisely, but we have a rather small sample.
Figure 5.11 Central Tendency Measures for Groups
Considering our interest in group differences, both the mean and median survival times are
considerably higher for the treatment group. We now examine the tests for group differences
(differences in the populations they represent) in survival functions.
Figure 5.12 Tests of Equality of Survival Distributions
We draw identical conclusions from each of the tests; testing at the .05 significance level, the
distribution of survival times is different for the two populations, and survival time is greater for
the treatment group (from Figure 5.11). The survival and hazard plots we view next should
confirm this interpretation.
We see a more rapid drop-off in the cumulative survival function for the control group, indicating
the survival time for this group is shorter than the treatment group. Notice the censored
observations are concentrated at the higher time values (the observations with plus signs). This is
expected if the study does not extend for a long period of time relative to the mean survival value.
Figure 5.13 Cumulative Survival Function Plot
The cumulative hazard plot tells much the same story as the cumulative survival plot (the
cumulative hazard is the negative of the natural log of the cumulative survival plot). It indicates
the risk of death increases more rapidly over time for the control group. Again, the conclusion is
that the treatment increases survival time.
Figure 5.14 Cumulative Hazard Function Plot
It might seem odd that the hazard value can be greater than 1, but the hazard is not a probability.
Instead, it is a rate—the conditional probability of death divided by a time interval. It indicates
the expectation that a subject will die at any particular time period, and it can take on any value
from 0 to infinity. Higher values, as you would expect, indicate a greater instantaneous hazard of
death occurring.
Cox Regression
Cox Regression is a survival model that represents hazard as a function of time and predictor
variables that can be continuous or categorical. Because it allows for multiple predictors, it is
more general than the Kaplan-Meier method. It is considered a nonparametric, or perhaps more
accurately, a semi-parametric model because it does not require a particular functional form to the
hazard or survival curves. This allows it to be applied to data sets exhibiting very different
survival patterns. As we will see shortly, the model does assume that the ratio of the hazard rate
between any two individuals or groups remains constant over time (it is also called Cox’s
proportional hazard model). If this assumption is not met, the Cox model has been extended to
incorporate time-varying predictors, which are interaction terms between the predictors and time.
In the Cox Regression model, the hazard function at time t as a function of predictors X1 to Xp,
can be expressed as shown below.
X 1 * B1 + X 2 *B 2 + ... + X p * B p
h ( t | X 1 , X x ,..., X p ) = h 0 ( t ) * e
The hazard function is expressed as the product of two components: a base hazard function that
changes over time and is independent of the predictors ( h0 (t ) ); and factor or covariate effects
X *B + X *B +...+ X *B
(e 1 1 2 2 p p
) that are independent of time and adjust the base hazard function. The shape
of the base hazard function is unspecified and is empirically determined by the data (the
nonparametric aspect of the model). Since the model effects relate to the hazard through the
exponential function, the exponentiated value of a model coefficient (say e B1 ) can be interpreted
as the change in the hazard function associated with a one-unit change in the predictor (X1),
controlling for other effects in the model.
The separation of the model into two components, one a function of time alone and the other a
function of the predictors alone, implies that the ratio of the hazard functions for any two
individuals or groups will be a constant over time (the base hazard function cancels out, leaving
the ratio constant over time and based on the differences in predictor values). If this model
assumption (the effects of the predictors are constant over time) is not met, then the Cox
Regression model will not provide the best fit to the data. As mentioned earlier, an extension to
the Cox Regression model supports time-varying covariates and there are diagnostic plots that aid
in assessing the assumption.
Since the hazard function is related to the cumulative survival function, this model can
alternatively be expressed in terms of cumulative survival. Survival plots, based on the model,
can be generated.
An Example: Cox Regression

Kleinbaum (1996) presents data provided by John Caplehorn (The University of Sydney,
Department of Public Health) in which survival time in addiction treatment programs was studied
as a function of several predictors. The outcome measure was time (in days) spent in a program
for heroin addicts. The terminating event was departure from (quitting) the program. Data were
censored for participants still in the program when the study was completed. Predictor variables
included clinic (there were two clinics whose programs differed), prison (whether or not the
addict had a prison record), and methadone dose (methdose, measured in mg/day). (Note that
there are slight differences in the data presented here from the data in Kleinbaum (1996); small
differences were found in the summary statistics.)
We will set up a Cox Regression model and view the results. In addition, we will examine the
model assumptions in terms of a predictor that Kleinbaum found troublesome.
Double-click on Addicts.sav
Figure 5.15 Methadone Study Data
There were two clinics in the study (coded 0 and 1). If the terminating event occurred (departure
from clinic program) the status variable was coded 1. The survtime variable records the number
of days spent in the program. Prison is coded 0 (no prison record) or 1 (yes, prison record) and
methdose records the methadone daily dose (mg/day).
Click Analyze…Survival…Cox Regression

Move survtime into the Time box
Move status into the Status box
Click the Define Event button
Type 1 into the Single Value text box
Click Continue
Move clinic, methdose, and prison into the Covariates list box
Figure 5.16 Cox Regression Dialog

Move clinic and prison into the Categorical Covariates list box
Figure 5.17 Cox Regression: Categorical Covariates Dialog
Since clinic and prison are dichotomies (coded 0,1), they could be left as continuous predictors in
the model. However, in order to obtain cumulative survival or hazard plots for the different
groups (prison – no, yes; clinic – 0, 1), we must specify these variables as categorical covariates.
(We could specify one as a strata variable, but then it wouldn’t be included in the model.)
Figure 5.18 Cox Regression Dialog with Defined Variables
The setup for time and event status is identical to the Kaplan-Meier procedure. As we saw with
Binary Logistic Regression (Chapter 3), the a*b button can be used to add interaction terms to the
model. By default, all predictor variables will be entered into the model. However, as with Binary
Logistic Regression, forward and backward stepwise variable selection methods are available.
We’d recommend using the Forward: LR (likelihood ratio) method if you wish to perform
automatic variable selection. Since there are only three predictors, we will use the default (Enter)
method to include them all.
Also as in the Kaplan-Meier procedure, there is a Strata box. A different base hazard function will
be fit to each category of a strata variable, although all categories will share the same predictor
coefficients. A strata variable can be used to explore the assumption of proportional hazard
functions, which we will see later.

Click the Survival and Hazard check boxes
Move clinic into the Separate lines for: box
Figure 5.19 Requesting Survival Plots
We viewed survival and hazard plots earlier in the context of the Kaplan-Meier procedure. The
log minus log [ln(–ln(cumulative survival))] plot is useful in evaluating the proportional hazard
assumption in the Cox Regression model for a strata variable (we will see this later). If a
categorical covariate is specified in the Separate lines for box, separate lines appear in the
requested plots for each categorical value. Since these plots are based on the model, the values for
the other predictors (by default) are set to their means. However, you can change these values if
you wish (for example, you could set the value of prison to 1 in order to have the plots reflect
those with prison records).
The Save dialog (not shown) allows you to save such values as the cumulative survival function
estimates and standard errors, as well as some residuals. Although beyond the scope of this
discussion, it is worth mentioning that partial residual plots can be used to examine the
proportional hazard assumption of the Cox Regression model (see Klein and Moeschberger
(1997) or the SPSS Case Study on Cox Regression; if Case Studies were installed with SPSS,
click Help…Case Studies…Advanced Models Option…Cox Regression).
The Options dialog (not shown) controls some of the stepwise options (for example, selection
criteria) and contains a check box to display confidence intervals for the e Bi estimates.
Click Continue
Click OK
The Case Processing Summary table reports the frequency of the event and censoring, and counts
of cases dropped from the analysis. The analysis will be based on 238 patients. Cox Regression
requires complete data (predictors, time and event variables) for all cases included in the analysis.
Figure 5.20 Case Processing Summary Table
By default, Cox Regression sets up an indicator-coding scheme for categorical predictors and the
last category is the reference category.
Figure 5.21 Categorical Predictor Coding
Thus, for the prison variable, the “yes” category (coded 1) will be the reference category in the
model. Note that this is the reverse of the original coding of prison. If you prefer to specify the
first category as the reference category, which in this case might be less confusing, you can do so
within the Cox Regression: Define Category Covariates dialog. We will proceed with the default
coding, but need be mindful of it while interpreting the estimated model coefficients.
Figure 5.22 Omnibus Tests of Predictors
Since all predictors were entered at once, the values reported in the Change From Previous Step
and Change From Previous Block sections are identical. Here we are testing whether the effect of
one or more of the three predictor variables are significantly different from zero in the population.
This is analogous to the overall F test used in regression analysis. The results indicate that at least
one predictor is significantly related to the hazard. (An omnibus test is also done using the score
statistic, which is used in stepwise predictor selection.)
Figure 5.23 Variables in the Equation Table
The B coefficient estimates relate the change in natural log of the hazard per one unit change in
the predictor. As such they are difficult to understand (although positive values are associated
with increasing hazard and lower survival time, while negative values are associated with
decreasing hazard and increasing survival times). For this reason, the Exp(B) column is usually
used when interpreting the results.
The significance of each predictor is tested using the Wald statistic and the associated probability
values are reported in the Sig. column. Here, clinic and methadone dose (methdose) are
significant at the .05 level, while prison (.06) is not.
The Exp(B) column presents the estimated change in risk (hazard) associated with a one-unit
change in a predictor, controlling for the other predictors. When the predictor is categorical and
indicator coding is used, Exp(B) represents the change in hazard when changing from the
reference category to another category and is referred to as relative risk. Exp(B) is also called the
hazard ratio, since it represents the ratio of the hazards for two individuals who differ by one unit
in the predictor of interest. The Exp(B) value for clinic is 2.743; this means that other things
equal, the hazard in clinic 0 is 2.743 times greater than the hazard in clinic 1 (recall the indicator
coding). Thus patients in clinic 0 exhibit greater risk and lower survival times. The estimated
coefficient for methdose (.965) indicates that a one-unit (one mg/day) increase in dosage is
associated with a decrease (.965) in hazard. This shift may seem small, relative to the clinic
effect, but it is important to be aware that a one-unit change in clinic represents the entire effect
of that variable, while a one-unit change in methdose is about 1/90th of the range of methadone
dosage (20mg/day to 110 mg/day) in this study. Although prison is not significant, its estimated
coefficient (.730), if significant, would indicate that the group with no prison record is at less risk
than the group with a prison record. In this way, individual predictors can be examined for
statistical significance and their coefficients interpreted.
Since prison was found to be non-significant, if everything else checked out, the next step would
be to rerun the model without the prison variable. We will consider this after reviewing the
remaining output and considering the proportional hazards assumption.
Figure 5.24 Survival Plot for Clinic Groups
As requested, the survival functions for the two clinic groups appear as separate lines, both
adjusted to the mean level of prison and methdose. The difference in clinics (clinic 0 having
greater risk) is reflected in the lower survival values for clinic 0; clinic 1 seems more effective in
retaining patients in the program for a longer time period.
Figure 5.25 Hazard Plot for Clinic Groups
Similarly, the cumulative hazard plot shows the greater risk of the event (program departure)
associated with clinic 0.
Checking the Proportional Hazards Assumption

In the previous example, we fit a proportional hazards model but did not examine the proportional
hazards assumption (that the hazard functions of any two individuals or groups remain in constant
proportion over time). There are several approaches to this for predictors, some available within
the Cox Regression and Kaplan-Meier procedures:
• Examine the survival or hazard plots (in Kaplan-Meier) with the categorical predictor as
the factor
• Examine the survival or log-minus-log plot in Cox Regression with the categorical
predictor specified as a strata variable
• Save partial residuals and plot them against time (see Cox Regression case study for an
example)
• Fit a Cox Regression model with a time-varying covariate; examine its significance and
contribution
We will illustrate by specifying clinic as a strata variable within Cox Regression and examining
the survival and log-minus-log plots.
Click the Dialog Recall tool button , click Cox Regression

Click the Log minus log check box [not shown]
Click Continue
Remove clinic from the Covariates list box
Move clinic into the Strata box
Figure 5.26 Cox Regression with Clinic as a Strata Variable
A separate base hazard function will be fit to each category of the strata variable. If clinic does
not conform to the proportional hazards assumption, this should be revealed in the survival and
log minus log plots, which will present a line for each clinic category (since clinic is the strata
variable).
Click OK
Since the focus of this analysis is on the proportional hazards assumption, we will not examine
the model results but move directly to the diagnostic plots.
Scroll down to the Survival plot
Figure 5.27 Survival Plot: Clinic Groups Adjusting for Prison and Methadone Dose
The survival plots for the two clinics diverge substantially over time, suggesting that the hazard
ratio for the two groups is not constant over time. Because clinic is not used as a predictor in the
model (it is a strata variable), the expected survival values allow us to assess whether clinic
would meet the proportional hazards assumption in this context.
Scroll down to the LML plot
Another way of examining the proportional hazards assumption is through the ln(-ln) plot. It is
easier to use than the survival plot, since we simply have to judge whether the lines for the
different groups are parallel. However, the price paid for this simplicity of view is a more
complex function. It can be shown mathematically (for example, see Chapter 4 in Kleinbaum,
1996) that if the proportional hazard model holds (in our example it means that over time, the
hazard functions of the clinics differ by a constant proportion), then the natural log of the
negative of the natural log of the survival functions for different groups over time will form
parallel lines. Here the ln(-ln) survival lines diverge, especially after about 1 year (365 days),
indicating that the proportional hazards assumption does not hold for clinic. Not done here, a
similar analysis could be run for the prison predictor.
Figure 5.28 Log Minus Log Plot: Clinic Groups Adjusting for Prison and Methadone Dose
In light of this, the data do not seem to meet the assumptions of the Cox Regression model we fit
and it may be possible to improve upon it. One option is to model this data using an extended Cox
Regression model including a time-varying covariate. In this case, it would mean an interaction
term between clinic and some function of time, meaning that the effect of clinic on the hazard
function is not constant over time. In his presentation of this data, Kleinbaum (1996) discusses
several extended Cox models. We briefly introduce one of them next in the appendix.
Appendix: Brief Example of Cox Regression with a

Time-Varying Covariate
Cox Regression can be extended to incorporate time-varying covariates. It supports covariates
that intrinsically vary over time, for example body weight or blood pressure (these are called
segmented time-dependent covariates). It also accommodates non-proportional models in which
the effect of a fixed predictor (for example, clinic in our previous example) varies over time,
involving interaction terms that combine some function of time with predictor variables.
We will briefly demonstrate how to include a time-varying covariate by extending the Cox
Regression model fit above. In that analysis, the survival plot (Figure 5.27) and ln(-ln) plot
(Figure 5.28) indicated that the effect of clinic increased over time, noticeably so after one year.
To account for this we will use a simple step function over time, set to 0 when time is one year or
less and set to 1 when time is greater than one year (in the survival analysis literature, this is
called a heavy-side function). This function will be multiplied by clinic, thus creating an
interaction term of time by clinic. In the extended Cox model, the coefficient for clinic will
represent the clinic effect when time is one year or less and the coefficient for the time-varying
covariate will represent the change in clinic effect after one year.
Click Analyze…Survival…Cox w/ Time-Dep

Enter (T_ > 365) * clinic in the Expression for T_COV_ window (typing or using buttons)
Figure 5.29 Compute Time Dependent Covariate Dialog
In this dialog, we create the time-dependent covariate (named T_COV_) to be used in the
extended Cox model. T_ is a special variable that represents time that we use in creating the
expression. Arithmetic operations, mathematical functions, and logical operations can be used. In
our example, (T_ > 365) will be evaluated as false (or 0) when survival time is 365 days or less,
and will be true (or 1) when survival time is greater than 365 days. Thus, depending on the
survival time, 0 or 1 is multiplied by clinic, which creates an interaction term of time and clinic.
This effect will be included in the model. Only a single time-dependent covariate can be entered
using this dialog, but the Cox Regression syntax supports multiple time-dependent covariates.
See Help..Command Syntax Reference..COXREG.

Move survtime into the Time box
Move status into the Status box
Click the Define Event button
Type 1 into the Single Value text box
Click Continue
Move T_COV_, clinic, methdose, and prison into the Covariates list box
Move T_COV_, clinic and prison into the Categorical Covariates list box
Click Continue
Figure 5.30 Completed Time-Dependent Cox Regression Dialog
The time by clinic interaction effect (T_COV_) has been included as a predictor in the model.
Click OK
In this brief example, we will not review all the output, but move directly to the table
summarizing the variables in the equation.
Scroll to the Variables in the Equation pivot table in the Viewer window
Figure 5.31 Cox Regression Results with a Time-Varying Predictor
The time-dependent predictor (T_COV_), representing the interaction of time and clinic, is
significant (.004). In addition, the clinic predictor (which now represents the clinic effect during
the first 365 days) is no longer significant (.06). However, since its significance value is near .05
and it is used in the definition of the time by clinic interaction, it would probably be retained in
the model.
Interpreting the estimated coefficients in the Exp(B) column, we would say that controlling for
the other factors, in the first year (time <= 365) the hazard in clinic 0 is 1.616 times greater than
the hazard in clinic 1. Not surprisingly, this differs from the original estimate of the clinic effect
(2.743 – see Figure 5.23). The time-dependent covariate (T_COV_) indicates that after the first
year (time > 365) the hazard in clinic 0 is 6.123 times greater than the hazard in clinic 1. The
6.123 value is obtained by multiplying the clinic effect (1.616) by the clinic by time interaction
(3.789); thus after the first year, the clinic effect is estimated to increase by a factor of 3.789.
Alternatively, since after 365 days the total effect of the hazard in clinic 0 is both clinic and
T_COV_ , and both are dummy variables, we can simply add their B coefficients together (1.332
+ .480 = 1.812) and then exponentiate the result, which will also be 6.123.
The estimated effects of methdose and prison (.965 and .694) are very similar to the values
obtained in the standard Cox Regression model (Figure 5.22). However, in the presence of the
time-dependent covariate, the prison effect is now significant at the .05 level (Sig. value is .03).
Thus, controlling for the other effects in the model, those with no prison record (prison code 0)
have .694 of the hazard of those with a prison record (prison code 1).
In this way, Cox Regression can be extended to incorporate time-dependent covariates. Note that
when such models are fit, plots and saved estimates are not available.
Extensions
We used a simple two-step (heavy-side) function of time to create the time-dependent covariate.
Other functions of time are possible and may improve the model. For example Kleinbaum (1996)
fit a more extensive step function (values increased from 1 to 3 to 5 to 7 over time), and others
could be tried.
Chapter 6
Cluster Analysis
Topics
• How does Cluster Analysis work?
• Types of Data in Clustering
• What to Look for When Clustering
• Methods
• Distance and Standardization
• Overall Recommendations
• Example I: Hierarchical Clustering of Product Data
• Example II: K-Means Clustering of Usage Data
• Example III: TwoStep Clustering of Telcom Data
Introduction
Cluster analysis is an exploratory data analysis technique designed to reveal natural groupings
within a collection of data. This technique has gained popularity because it addresses a
fundamental analysis task—categorization or separation. Cluster can be expected to suggest
potentially useful ways of grouping just about anything (e.g. respondents, customers, insects,
patients, products, etc.). Its main advantage is that it can suggest, based on complex input,
groupings that would not otherwise be apparent. While cluster holds great promise for providing
useful information regarding your data, an important caution is necessary at the outset. You
should not expect it to provide a perfect, effortless (on your part), seamless, complete, and
timeless solution. Time and effort are necessary, but not sufficient, ingredients for finding a good
solution from a cluster analysis.
To provide an overview of this procedure we first review the types of data on which clustering is
typically performed, and then introduce the main practical aspects of a cluster analysis. Next we
discuss the two most common cluster methods, hierarchical and non-hierarchical, as well as the
pros and cons for standardizing variables when clustering. Finally, three examples of cluster
analysis are presented (one for each of the SPSS cluster procedures). The objective here is to
introduce clustering and possibly provide a jumping-off place for your analysis. Please consult
one of the many excellent clustering references available for a more extensive discussion of these
topics (e.g. Aldenderfer and Blashfield, 1984, Everitt et al., 2001).
Cluster Analysis 6 - 1
How Does Cluster Analysis Work?

If cluster analysis is an exploratory data analysis technique designed to reveal natural groupings
within a collection of data, the next questions is how? The basic criterion used for this is distance.
In short, data observations close together should fall into the same cluster while observations far
apart should be in different cluster groups. Several methods for calculating distance will be
described. Ideally the observations within a cluster would be relatively homogenous, but different
from those contained in other clusters. Since cluster analysis is an exploratory data method,
expecting a unique and definitive solution is a sure recipe for disappointment. Rather, cluster can
suggest useful ways of grouping the data. In practice, you often consider several solutions before
deciding on the most useful one, although the TwoStep procedure will automatically select the
number of clusters. Different cluster (and standardization) methods provide slightly different
perspectives on your data. In practice, different methods can produce different solutions; the
silver lining in this fact is that reconciling the differences often provides insight into the structure
of your data.
Data Types in Clustering

Since cluster analysis is based on distances derived from the measures taken on the observations,
these measures are typically interval scale, ordinal scale (for example, responses coded 1 to 5 for
a strongly agree to strongly disagree rating scale are usually viewed as a “close enough”
approximation to an interval scale), or binary (for example, checklist items). For binary or
checklist data, some customized distance measures have been developed (available in the Cluster
procedure). In addition, the SPSS TwoStep cluster procedure can work with a mixture of
categorical and continuous variables.
Cluster analysis has been performed on all kinds of data. Here are just a few examples:
classifying (1) plants and animals into ecological groups, (2) companies for product usage, (3)
respondents participating in social science research investigating nonverbal judgments of facial
expressions, and (4) children under observation at an aphasic clinic based on physiological,
psychological or biographic measures. Classification of things into groups is a fundamental
human endeavor. Most often, a cluster analysis is based on only one (or perhaps two) of the data
types mentioned above. This is primarily done in an attempt to reduce complication in the
interpretation; nothing in the cluster procedure itself requires this.
What to Look For When Clustering

Separation
In this section we outline what to focus on while running cluster analysis. When we initially
evaluate a cluster solution we can examine one or more representations of the separation between
the groups. We will focus on whether the groups are arguably distinct and whether some are more
separated than others are. One graphical method to depict separation is a dendrogram. Figure 6.1
displays this type of chart, produced by a hierarchical cluster analysis of 20 beers.
Figure 6.1 Dendrogram from a Cluster Analysis
The original observations, here beers, (what you are clustering) are listed along the left vertical
axis. The horizontal axis represents distance between the clusters. The dendrogram illustrates how
observations combine into clusters based on distance. Observations connected into clusters (by
vertical lines) near the left of the diagram are closer together, while those connected near the right
are relatively far apart. We can see that there seems to be four distinct clusters of beer types
which immediately resolve into two clusters: US domestic and imported beers versus light beers.
Number of Observations in Clusters

The number of observations in each group will also be relevant. Outliers can form useless or
meaningless clusters containing just a few observations. Below we present a frequency table
showing eight cluster groups based on customer reports of software usage.
Figure 6.2 Sizes of Cluster Groups
Of the 310 customers clustered into eight groups, we see that four of the clusters contain a
substantial number of observations while the remaining clusters contain only one or two cases.
Depending on the procedure used, we might rerun the analysis, dropping the data from the small
clusters, or ignore these clusters, or request special handling of outliers (TwoStep). This is not to
deny the validity of them as distinct groups, but the numbers are far too small to be of interest (for
example, too small to consider basing a marketing or sales plan on them). Also, note the first
cluster contains almost half the data.
Cluster Profiles
For help in understanding and interpreting the clusters a good place to start is by examining the
cluster group means on the continuous cluster variables and percentages on the categorical cluster
variables. These can be viewed in table format or as plots to aid visualization. Below we show a
multiple line chart for beers clustered on cost, amount of alcohol, amount of sodium, and calories.
The variables were standardized and four clusters are displayed.
Figure 6.3 Profiling Cluster Groups
The graph shows that the four clusters are relatively distinct on the clustering variables. One
cluster (2) is high on alcohol, cost (especially), and calories but low on sodium (imported beers?);
another cluster (1) is relatively high on alcohol, calories and sodium but low on cost (US
domestic beers?); a third cluster (4) has relatively low mean values on all four variables (light
beers?); finally, a fourth cluster (3) has midrange values on all four variables (lower-cost
imports?).
Validation
Also, a validation step is important. Ask yourself whether the clusters make conceptual sense? Do
the clusters replicate using a different sample (if available)? Do clusters replicate using a
different clustering method? Finally, ask yourself whether the clusters are useful (theoretically
and/or practically).
Methods
At a general level the purpose of cluster analysis can be simply stated: to suggest naturally
occurring groups in data based on proximity. However, there is considerable complexity in the
details. The main sources of this complexity are variations in how distances between observations
are calculated and in the rules used for cluster formation. This is, at least in part, because cluster
analysis has been applied to so many diverse types of data. Over time, variants of cluster have
been devised to perform optimally with specific data structures. In this section we will not review
every clustering method and distance calculation, but rather will discuss the most popular
methods, their strengths and weaknesses. As mentioned earlier, no single method is best in all
conceivable situations, but some clustering methods do well (recover known solutions in
simulation studies), or at least better than others, under fairly general conditions.
The first broad distinction in clustering methods is between hierarchical and nonhierarchical
clustering techniques. Hierarchical clustering requires observations to remain together once they
have joined in a cluster. Nonhierarchical clustering does not impose this restriction. Each will be
described in turn.
Hierarchical Methods
Within the class of hierarchical cluster methods there are a number of alternatives to how distance
between clusters is evaluated. The type of data you are working with and what makes the most
conceptual sense for the clusters you are trying to form typically govern the choice you make.
The next two figures illustrate four common ways to assess distance. The Nearest Neighbor
(single linkage) method calculates distance between clusters as the smallest distance among all
pairs of points, one from each group. A unique property of this method is that it tends to create
clusters shaped like sausage links. Conversely, you can base your distance calculations on the
Furthest Neighbor (complete linkage) method. This method is also relatively insensitive to
outliers and, as the name suggests, the two furthest members within the clusters determine the
distance between two clusters.
Another alternative is the Centroid method, which assesses distance between clusters based on the
distances between their means. This method is also relatively robust to outliers, but doesn’t do
well with noisy data (e.g. data in which points appear between clusters). Between-Groups
Average Linkage assesses the distance for all possible pairings of points between clusters, and
then takes the distance between the clusters as the average of these distances. This is the default
method for assessing distance in SPSS. It performs well with noisy data (data in which points
appear between clusters) and over a variety of conditions. It’s weakness, however, is that outliers
influence it.
Finally, Ward’s method (not illustrated) creates clusters that yield the smallest possible within
cluster variance. Like Between-Groups Average Linkage, it is good with noisy data and sensitive
to outliers. An important additional property of Ward’s method is that it tends to create clusters of
about equal size. Note that the influence of outliers cited above is mitigated if complete coverage
(every observation is assigned to a cluster) is not required. For more detail on cluster method
comparisons, see Punj and Stewart (1983).
Figure 6.4 Nearest and Furthest Neighbor Methods
Figure 6.5 Centroid and Between-Groups Average Linkage Methods
Non-Hierarchical Method: K-Means Clustering

As mentioned above, nonhierarchical clustering methods do not require that two observations
placed in the same cluster must remain together. Thus they impose a less rigid structure on the
data than do the hierarchical methods.
The most popular is K-means clustering (or the K-means algorithm). The “K” in the name is
derived from the fact that for a specific run the analyst chooses (guesses) the number of clusters
(K) to be fit. The “means” portion of the name refers to the fact that the mean (or the centroid) of
observations in a cluster represents the cluster. Since the analyst must pick a specific number of
clusters to run, typically several runs are made, each with a different number of clusters, and the
results evaluated with the criteria mentioned earlier (separation, group size, pattern of means, and
validation). Since the number of clusters is chosen in advance and is usually small relative to the
number of observations, the K-means method runs much quicker than the hierarchical methods.
This is because if seven clusters are run, then the program only tracks the seven clusters. In
hierarchical clustering the distance between every pair of observations must be evaluated and
inter-cluster distances recomputed at each cluster step (which is a fairly intensive computational
task). Thus when clustering many observations (many hundred, thousands), K-means is usually
the method of choice (although you can sample from the large file and apply a hierarchical
method, or run TwoStep clustering, which also runs efficiently on large files).
One nice feature of K-means clustering (in SPSS) is that you can try out your own ideas (or apply
the results of other studies) as to the definition of clusters. An option is available in which you
provide starting values (group means) for each of the clusters, and the K-means procedure will
base the analysis on them.
A brief description of the K-means method follows. If no starting values (see preceding
paragraph) for the K cluster means are provided, the data file is searched for K well-spaced (using
distances based on the set of cluster variables) observations and these are used as the original
cluster centers. The data file is then reread with each observation assigned to the nearest cluster.
At completion every point is assigned to some cluster, and the cluster means (centroids) are
updated due to the additional observations (optionally the updating can be done as each
observation is assigned to a cluster). At least one additional data pass (you control the number of
iterations) is made to check that each observation is still closest to the centroid of its own cluster
(recall the cluster centers can move when they are updated because of the addition or deletion of
members), and if not, the observation is assigned to the now nearest cluster. Additional data
passes can be made until there are no occurrences of observations switching clusters.
Based on simulation studies, K-means clustering is effective when the starting cluster centers are
well spaced. Also, for large files it runs far faster than the hierarchical methods and requires
considerably less RAM memory. However, several trials are typically run before settling on a
solution. Finally, since the method is not hierarchical, the dendrogram, useful in evaluating
cluster solutions, cannot be created.
Non-Hierarchical Method: TwoStep Clustering

Compared to other clustering methods, TwoStep clustering has several potential advantages. In
terms of scalability, TwoStep clustering can be performed in a single data pass and does not
require a large amount of memory storage. TwoStep can automatically select the number of
clusters based on statistical criteria, unlike the other methods reviewed. Finally, TwoStep
incorporates both categorical and continuous variables in its clustering algorithm, taking account
of their different properties. As such, it presents an attractive clustering method, especially when
applied to large datasets (for example, in data mining applications)
The two steps in the TwoStep algorithm refer to a pre-clustering step and a clustering step.
During the pre-clustering step, individual records are grouped into pre-clusters (maximum
number of pre-clusters is under user control) in a single data pass. Next, in the clustering step, an
agglomerative hierarchical cluster analysis algorithm is applied to the pre-clusters. Statistical
criteria are recorded as clustering is performed and are used to determine the optimal number of
clusters (within a user-specified range). These steps are described in greater detail below.
Assumptions
The TwoStep algorithm assumes that the continuous cluster variables are uncorrelated and follow
normal distributions. To the extent that these variables are correlated, the source of the shared
correlation will receive greater weight in the analysis (for example, if identical copies of a
variable were included, the weight of this variable would double). Categorical variables are
assumed to follow multinomial distributions and to be independent of each other, as well as
independent of the continuous predictors. These assumptions are made because, as we will see,
distance in TwoStep is measured using the log-likelihood function by default, and so
distributional assumptions are made. Internal testing at SPSS indicates the algorithm is somewhat
robust over violations of these assumptions, but it should be noted that the algorithm is tied to
distributional assumptions, unlike the K-means or most hierarchical clustering methods.
Pre-Clustering Step
In the pre-clustering step, individual records are placed into pre-clusters during a single data pass.
The maximum number of pre-clusters is determined by Advanced Option settings in the TwoStep
procedure (by default, the maximum is over 500). Records are placed in pre-clusters using a CF
tree (Cluster Features tree) structure (see Zang et al., 1996 or Chiu et al., 2001), which is similar
to the B+ tree structure used in sorting algorithms. As a new record is processed, it moves
through the CF tree and if its distance to the nearest pre-cluster (leaf) is within the criterion value
(Advanced Options provide control over the criterion), it is added to the pre-cluster; otherwise, it
forms a new leaf in the CF tree (a new pre-cluster).
If the CF tree fills and the noise handling option is not selected, the distance criterion is increased
and the now smaller tree reorganized. If the noise handling option is selected, then outliers
(defined as leafs, or pre-clusters, containing few records relative to the largest leaf) are grouped
together in a single leaf, freeing space in the CF tree.
Distance is calculated, by default, using the change in log-likelihood function. Thus a record is
placed into the pre-cluster to which it has the greatest likelihood of belonging (under the
assumptions mentioned earlier). A pre-cluster is described by only a few summary statistics:
sample size; for each continuous cluster variable the sum and sum of the squared values; for each
categorical cluster variable, the count in each category), so relatively few computational resources
are needed to perform the pre-clustering.
At the end of a single data pass, the pre-clusters are ready to be input to the second step
(clustering) of the TwoStep process.
Clustering Step
In the clustering step, an agglomerative hierarchical cluster analysis is performed on the pre-
clusters using the log-likelihood function (by default) change as the distance measure. Since the
number of pre-clusters is usually small relative to the number of original records, the hierarchical
clustering can run relatively quickly.
At each stage of the hierarchical clustering, a goodness-of-fit statistic (AIC (Akaike Information
Criterion) or BIC (Schwartz Bayesian Criterion)), indicating how well the data are fit by the
current number of clusters, is recorded. Smaller values are better for the AIC and BIC measures,
which are functions of –2*log-likelihood and contain a penalty for complexity of the model—
here complexity corresponds to the number of clusters. Also tracked at each clustering stage is the
change in goodness-of-fit (AIC or BIC) from the previous cluster step and the ratio between the
goodness-of-fit change in this step and the change in the step from two clusters to one cluster
(baseline for relative change). Thus the change in AIC or BIC at each cluster step is monitored, as
is the change relative to the baseline step (2 clusters to 1 cluster). Often as the number of clusters
increases, the criterion first decreases (more clusters better fit the data) and later increases (when
more clusters than necessary are added, the penalty component is noticeable). The algorithm
examines the relative size of successive changes in the criterion measure, notes the number of
clusters where this change becomes small (meaning a relatively small improvement occurs as an
additional cluster is added), then makes a more refined search (described below) of clusters in the
vicinity.
Also calculated at each cluster step is the ratio of the distance measure (by default, the distance
measure is the change in log-likelihood when two clusters are combined) in the step to the
distance measure in the previous clustering step. The changes in these ratios are used to refine the
cluster solution. A good cluster solution will show a relatively large jump since the distance
measure (by default, the change in log-likelihood) will increase when the correct number of
clusters is fit. The relative change in AIC or BIC (described above) may also show a jump since
the fit measures will improve (AIC or BIC decrease) when truly separate clusters are fit. In the
algorithm, the first ratio (involving relative change in AIC or BIC) is used to obtain a coarse
estimate of the number of clusters and the second (involving change in distance measure) refines
the selection.
Distance and Standardization

In addition to the several cluster methods, there are various techniques for measuring distance
between observations. In Figures 6.4 and 6.5 the distance between two observations is
represented as “straight-line” or Euclidean distance. If there are three variables involved in the
clustering (x,y,z) then the Euclidean distance between two observations ((x1,y1,z1), (x2,y2,z2)) is
sqrt[(x1-x2)2 + (y1-y2)2 + (z1-z2)2]. Distance calculated this way is intuitive, but analysts more
often use squared Euclidean distance, which exaggerates the effect of larger distances, but is more
in tune with the sums of squares type measures used by many statistical procedures.
Other distance metrics can be applied to clustering. For example, the TwoStep procedure supports
using the log-likelihood change as a distance measure, assuming a normal distribution for
continuous variables and multinomial distribution for categorical variables.
SPSS also offers several possibilities if the variables you cluster on represent counts (for
example, frequency of purchase or usage of different products) and a number of distance
measures if the data are binary (for example, if respondents are presented with an adjective
checklist about an activity, or a checklist of activities they participate in). One interesting binary
distance measure in the Cluster procedure (Lance and Williams) would count two respondents as
similar if they both check an item, but would not count them as similar if neither checked the
item. Generally speaking, however, the clustering method has more influence on the solution than
does the choice of distance calculation.
A final issue has to do with whether the variables used to cluster should be standardized in some
way. There are several relevant issues regarding standardization. First, if your cluster variables
represent entirely different scales, then standardization is usually done. For example, suppose we
clustered on respondents on demographic variables, income in dollars and age. Since income in
dollars would have a much greater standard deviation than age, the income variable would
completely dominate the solution (it would be as if we stretched the income axis, then calculated
distances). This is demonstrated in the figure below.
Figure 6.6 Distances with Different Scales
The vertical (income) and horizontal (age) axes are set to the same scale units (single units). It is
clear to the eye that by measuring income in dollars (and not thousands or tens of thousands), any
distances between observations will be almost completely determined by income.
In order to avoid this, when different scales are used in clustering, standardization (using z
scores—variables normed to have a mean of zero and standard deviation of 1) is often done.
Converting to z scores forces each variable to have the same standard deviation and so levels the
playing field across variables. However, we should point out that standardization could reduce the
influence of important clustering variables (by shrinking their scale) and for this reason it is not
performed automatically. But, as you no doubt anticipate, other standardization methods are
available. For example, variables can be normed to range from –1 to +1, or from 0 to 1, or so that
the maximum is 1, etc. Some researchers report that standardizing to a range (for example, from 0
to 1) produces superior results to standardizing with z scores. Note that standardization is not
required if the log-likelihood is selected as the distance measure in TwoStep.
Overall Recommendations
For smaller data files (less than a few hundred, especially fewer than 100) several hierarchical
methods (complete linkage, between-groups average linkage, Ward’s method) do fairly well and
the results can be displayed using the dendrogram. In this situation, the TwoStep procedure offers
fewer distance measures but has the advantage of automatically selecting the number of clusters.
For large files (many hundred, thousands) K-means and TwoStep clustering are efficient methods
and from a resource perspective, may be the only possibilities (other than clustering a sample).
Although both between-groups average linkage and Ward’s method are influenced by outliers,
this problem can be at least partially addressed by removing those cases forming single (or very
small) clusters from the data file and rerunning the cluster analysis.
Squared Euclidean distance is most common in clustering studies, although TwoStep uses log-
likelihood distance in order to accommodate mixtures of continuous and categorical variables. If
clustering on variables with very different scales, then standardization of some sort (typically z
scores) is performed on the variables. Generally, the clustering method influences the solution
more than the distance measure chosen. If the scale (standard deviation) varies substantially
across the variables, then standardization can strongly influence the solution (not so when log-
likelihood distance is used in TwoStep).
Example I: Hierarchical Clustering of Product Data

This example is based on data evaluating 20 brands of beer (the data appeared in Consumer
Reports). The measures are both objective (e.g., percent alcohol, calories, sodium in mg, cost per
12 fluid ounces) and subjective (rated quality of beer). We are interested in determining if the
objective measures can be used to separate these beers into natural clusters. If so, it would be
interesting to know if the clusters are supported by the subjective measure. For example, there
may be a group of beers high in calories, percent alcohol and cost. Are these beers also rated as
higher quality? Do they concentrate in a certain country?
A Look at the Data

Before proceeding with the actual cluster analysis we’ll do a preliminary evaluation of the data.
We need to get our bearings as well as a sense of the central tendency and variability across the
entire sample for our objective measures. Another reason for starting our cluster analysis here is
to check and see if standardization is necessary. It’s likely that we’ll have to standardize since
these four measures are all on different scales. This process will consist of running Descriptives.
In SPSS for Windows:
Move to the c:\Train\Advstat folder
Double-click on Beer.sav
Click Analyze…Descriptive Statistics…Descriptives
Move alcohol, calories, cost and sodium into the Variables list box
Figure 6.7 Descriptives Dialog Box
We will view the default statistics (Mean, Standard Deviation, Minimum, and Maximum). We
can ask for additional statistics (using the Options subdialog box) if we wish more detail. Notice
the option to generate z scores. If we click on the check box next to Save standardized values as
variables we will create one new z score variable for each of the variables in the Variables list
box. We check this option because we will need these variables as z scores later. However, there
is also an option for standardizing within the hierarchical cluster analysis procedure.
Click the check box next to Save standardized values as variables

Click OK
Figure 6.8 Descriptives on Objective Measures of Beers
The Descriptive Statistics pivot table in the Output Viewer shows large differences across these
four variables for each of the statistics. Means vary from .5 to 132, and standard deviations range
from .14 to 30. The largest (30) is two hundred times greater than the smallest (.14). Since the
standard deviations are so different, we need to standardize our objective variables before
clustering so that the calories variable doesn’t unduly influence the distance calculations.
Setting Up the Hierarchical Cluster Analysis

We can now request the cluster analysis.
Click Analyze…Classify…Hierarchical Cluster

Move alcohol, calories, cost, sodium into the Variables(s) list box
Move beer into the Label Cases by list box
Figure 6.9 Hierarchical Cluster Analysis Dialog
Minimally, the variables used in the clustering must be specified. The Label Cases by box allows
you to use a string variable to identify observations in the plots and summaries that clustering
produces. If no label variable is given, then the case sequence number will be used. There is a
choice to cluster cases or variables. Since beers are the observations in our analysis, we cluster on
cases. By default both statistical summaries and plots will appear.
While we could run cluster at this point using default settings, we will explore various options.
Figure 6.10 Hierarchical Cluster Analysis: Statistics Dialog
By default, SPSS prints the agglomeration schedule that describes what occurs at each step of the
clustering. Deselecting the Agglomeration schedule check box or deselecting the Statistics option
in the Display area of the Hierarchical Cluster Analysis dialog can suppress this. A Distance
matrix would show the calculated distance values between every pair of observations in the file.
As such it can be huge, and is usually printed only for pedagogical purposes. The Cluster
Membership choices allow you to display for each observation, the cluster it belonged to during
specified stages of the cluster analysis. We’ll leave this unchanged and save cluster membership
information to the data file.
Click Cancel
Click the Dendrogram check box
Click the None option button in the Icicle area
The dendrogram is a valuable plot in cluster analysis since it provides information about which
elements cluster and when. In addition, the dendrogram provides information on distance between
clusters, and thus helps you judge the appropriate number of clusters. Dendrograms provide a
“big picture” view along with some of the detail. Icicle plots provide a more magnified look at
what changes occur during each step in the cluster analysis, but do not contain any distance
information. As such, they are less valuable than the dendrograms. For this reason we will not
request an icicle plot.
Figure 6.11 Hierarchical Cluster Analysis: Plots Dialog
Click Continue
Click the Method button
Select Z scores from the Standardize drop-down list
Figure 6.12 Hierarchical Cluster Analysis: Method Dialog
Between-Groups linkage is the default hierarchical method. To select a different method, simply
click on the drop-down arrow and make your choice. The Measure area controls the distance
measure and metric to be used in the analysis. Squared Euclidean distance is used by default.
Notice the measure choices are grouped by scale properties: interval scale, counts, and binary
choices. The Transform area permits you to standardize within each variable (across cases) or
within each case (across variables). Here we have the option within hierarchical cluster analysis
to standardize. Recall that you might standardize each variable if the cluster variables are not
measured on the same scale. Or, if you wanted to, you could standardize within each case. This
would be a useful thing to do if you were interested in clustering on similar patterns of responses
(for example, those with all high values and those with all low values show the same pattern).
The default is to invoke no standardization and the Standardize drop-down arrow permits you to
pick from a list of standardizing methods. Given our preliminary analysis, we standardize our
variables using z scores.
Click Continue
Click the Range of solutions button
Type 2 in the Minimum number of clusters text box
Type 5 in the Maximum number of clusters text box
Figure 6.13 Save Subdialog Box
The Save subdialog box allows us to create new variable(s) containing values for cluster
membership. This is the preferred alternative to getting a list of group membership under the
Statistics button. You might not save cluster membership variables during your first run, but
would instead examine the dendrogram to help decide on which cluster solutions to focus. We’ll
go ahead and save new variables representing a range of cluster solutions with our first pass, and
use summaries and plots of these variables to compare the solutions.
Cluster Results
The figure below shows the Output Viewer results from the hierarchical cluster analysis. Note
that two sections have been added: Proximities and Cluster. SPSS uses Proximities to calculate
distances. Also note that the Contents Pane displays the agglomeration schedule (if you don’t see
it either select it from the Outline Pane or scroll to it in the Contents Pane).
Figure 6.14 Agglomeration Schedule
The agglomeration schedule details what occurs at each stage of the analysis, from when the first
two observations form a cluster (Stage 1) to when the last two remaining clusters merge to form
one cluster (Stage 19). The most useful information in the schedule is the distances (see the
Coefficients column); they also appear in the dendrogram, so it is rare that an analyst spends
much time with the schedule itself for small sample sizes. Nonetheless, we will quickly review
how to read the schedule. With 20 observations, the schedule contains 19 stages.
The Stage column notes how far the clustering has progressed; at Stage 1 the two closest
observations form a cluster. The next two columns indicate which observations or clusters are
joined at each stage. In Stage 1, the 11th and 17th beers in the data (Coors and Hamms) form a
cluster. We know that two observations are joined here because it is the first stage, but we can
verify this by glancing at the two columns labeled Stage Cluster 1st Appears. If a value is 0, then
an observation is joined in the new cluster; if a value is other than 0, then a cluster originally
formed at an earlier stage (the stage specified by the value) joins the new cluster. The convention
is to denote the cluster by the case sequence number of its first member. For example looking at
Stage 5 we see the cluster formed at Stage 1, denoted by case number 11, is combined with case
number 2 (Schlitz). The last column, Next Stage, tells us when the cluster formed in this stage is
merged with something else (the cluster formed at stage 1 is merged with an observation in stage
5). Using these columns you can trace the entire history of the clustering. The Coefficient column
contains the distance measure between the observations or clusters being joined. Notice that the
distance values at earlier stages are much smaller than values at later stages.
A common heuristic for determining the number of clusters is to look for a large gap in the
distance values at the bottom of the schedule. Here the largest gap is between the 18th and 19th
Stage, when the cluster analysis goes from having two clusters to having all the beers in one
cluster. This is a disappointing number of clusters. Notice though that there are also good-sized
gaps between Stages 16 and 17 and Stages 17 and 18, relative to the rest of the stages. Therefore
the schedule suggests between two and four clusters and we will have to rely on other sources of
information to refine the number of clusters. (Remember our admonition about cluster analysis
requiring effort on your part.)
Figure 6.15 Dendrogram of Cluster Analysis using Baverage
Observations are listed down the left column and distance is plotted across the top. Clusters are
formed when joined by vertical lines. Can you see a three-cluster solution (or perhaps four)?
Once we go beyond four clusters there are many additional splits at relatively small distances. At
first glance it seems the larger groups might be US domestic beers, imported beers, and light
beers. Can you suggest an interpretation for a four-cluster solution?
Compare Figure 6.15 to Figure 6.1, which displays a dendrogram for the same data using Ward’s
method instead of Baverage. Using Ward’s method a very striking two-cluster solution appears in
which all light beers are grouped together. Ward’s seems to give us a clearer picture of our
clusters, although it might be interesting to see just how the cluster of two beers (Olympia Gold
Light, Pabst Extra Light) differ from the other light beers, and how Kronenbourg and Ausberger
differ from the others. We will continue to evaluate the output from Baverage.
Supplementary Analysis
Since we do not have a definitive answer for the number of clusters from the cluster output itself,
we run additional procedures on the newly created cluster variables, which will hopefully narrow
down the possibilities.

Scroll right so the variable clu5_1 appears (not shown)
As a result of our Save request, the Cluster procedure has created four new variables (clu5_1 to
clu2_1) which contain the cluster memberships for each solution between five and two clusters
(not shown). We will run frequency analyses on these to find out how many beers make up each
cluster and examine some mean profiles. These analyses may shed some light on how many
clusters to choose.
Click Analyze…Descriptive Statistics…Frequencies

Move clu2_1, clu3_1, clu4_1 and clu5_1 into the Variable(s) list box
Figure 6.16 Frequencies Dialog Box
Frequency tables will tell us how many beers fall in each cluster for the various cluster solutions.
If we had a larger sample we would look for clusters with very few observations, since these
might represent outliers that would not be of interest. For the larger clusters, we look at their
relative sizes. Are they about equal? Does one cluster contain the vast majority of cases? If so,
you would examine a solution with more clusters to see if the large cluster splits into useful
subgroups.
Click OK
Figure 6.17 Frequency Output
Starting with two-cluster solution, notice there is one large cluster (n=18) and one small cluster
(n=2). This larger cluster is further divided in the three-cluster solution. In the four-cluster
solution the first large cluster is divided again. Finally with five clusters, one beer splits off by
itself. Which solution is best? We are interested in having more than two clusters, especially
given the small size of the second cluster. Referring back to the dendrogram, in the three-cluster
solution the imported beers have formed their own cluster (clu3_1=2), but there isn’t any clear
conceptual separation for the other beers (perhaps general domestic). This leaves the four-cluster
solution since nothing of substance happened in the five-cluster solution. In the four-cluster
solution it looks like we have US domestic, imported and light beers. However, the light beers
form two clusters for some reason. Next we will use a line chart to examine the mean values of
the cluster variables (cost, sodium, alcohol, and calories) for the four-cluster groups.
Mean Profiles of Clusters

Now we attempt to profile the four clusters. The best starting point is to create a table or chart
displaying for each cluster group the mean scores on the clustering variables. These profiles can
then be examined and should aid in describing and assigning names to the clusters. We will use a
case summaries report and a multiple line chart to display the group profiles.
Click Analyze…Reports…Case Summaries

Move zcost, zcalorie, zsodium, zalcohol into the Variables list box
Move clu4_1 into the Grouping Variable list box
Uncheck the Display Cases check box
Figure 6.18 Summarize Cases Dialog

Remove Number of Cases from the Cell Statistics list box
Move Mean into the Cell Statistics list box (not shown)
Figure 6.19 Cluster Profiles (Means)
The case summaries table presents means for the standardized four cluster variables. In the table,
unstandardized variables could have been used, but since we will create a plot from this table, we
used the standardized form. We could interpret the clusters from the mean patterns (for example
cluster 4 has low values on all the variables) in this table, but we will work from the plot.
Double-click on the Case Summaries pivot table (open the Pivot Table editor)
Click and drag to select the means for clusters 1 through 4 (do not select the Total
means unless you wish to display them as a reference)
Right-click on any of the selected cells
Click Create Graph…Line from the context menu
Figure 6.20 Profiles of Four Clusters
If we wanted to look at fewer cluster groups we could select the means for just some of the
clusters. Group 1 and Group 3 are almost mirror images of each other, with Group 1 above the
mean, and Group 3 below, on such measures as alcohol content, calories and sodium. If we look
at the beer names belonging to these groups (in the dendrogram or Data Editor), it is not
surprising that Group 1 contains domestic regular beers and Group 3 has domestic light beers.
Group 2 is much more costly than the others and its calories and alcohol content matches the
domestic regulars. Group 2 represents imported beers. Group 4 is very low on alcohol and
calories, well below the level of the domestic light group. These beers might be described as
“super lights” and the group is composed of just two beers. One consideration at this point would
be whether to carry three or four clusters in the remainder of the analysis. Here we have two
clusters of light beers, one of which is quite small. In making this decision, such aspects as size of
the clusters and meaning of the clusters in the context of your study would play a part in the
decision. For example, in a business application for cost reasons you may be limited in the
number of clusters you can target for advertising campaigns.
Relating Clusters to Other Variables

At this point we can tentatively identify our clusters. We might try to validate the solution by
rerunning the cluster analysis with a different method (for example, TwoStep), or if we had
additional data, we might apply the same cluster method to a different sample of data. In this
example we will proceed to address our secondary question. Do subjective ratings of beers follow
our clustering? You might expect that more expensive and richer beers (imports) would be rated
better than lower costing, lighter beers. We’ll run a crosstabulation to find out.
Click Analyze…Descriptive Statistics…Crosstabs

Move rating into the Row(s) list box
Move clu4_1 into the Column(s) list box
Click the Cells button
Click the Column check box in the Percentages area
Click Continue
Figure 6.21 Crosstabs Dialog Box
We are requesting a two-way table of rating by the four clusters. Column percentages were
requested to see what percentage within a cluster group was given each rating. Note that this
crosstab will have a number of empty cells. This means we would be on shaky ground if we were
interested in statistically testing the relationship between our cluster variable and rating (the Chi-
square test can be unreliable when more that 20% of the cells have fewer than five observations).
However, if this were an important issue an alternative would be to test this relationship with the
SPSS Exact Tests option.
Click OK
Figure 6.22 Rating and Cluster Groups
Looking over the percentages we see that the import beers (Group 2) are all rated as good or very
good, the domestic regular beers (Group 1) are evenly distributed across the rating scale, light
beers (Group 3) are mostly good, and, interestingly, the two beers in cluster 4 (the super lights)
are both rated as fair (the worst rating).
Summary of the First Cluster Example

At this point we really only have preliminary results. It appears that the beers naturally separate
on the objective measures and that the separation is consistent with the subjective ratings. Since
we have a variable identifying the cluster group, we can look for additional relations between the
cluster group and other variables in the study. Also, as mentioned above, we might try to validate
the solution by running another cluster method. If you are disappointed that there was no final
test, please recall our warning that clustering is an exploratory procedure. As with other
exploratory data analysis procedures, at times clustering is as much an art (especially when
naming clusters) as a science.
Example II: K-Means Clustering of Usage Data

In this example we will use data from a survey concerning SPSS product usage (note that the data
have been modified so as not to compromise confidentiality). Customers were presented with a
list of SPSS modules, and they checked which ones they used. Our question is whether the
customer base can be usefully divided into different segments based on product usage. If so,
different marketing strategies can be developed to appeal to different segments.
Each module was defined as a separate variable coded 1 if checked and 0 otherwise. Modules
include Basics, Professional Statistics (now named Regression Models), Advanced Statistics
(now named Advanced Models), Time Series, Presentation Tables, Perceptual Maps (now names
Categories), Automatic Interaction Detection (CHAID), Mapping, and Neural Nets. While many
other questions were asked as part of the study, our data file includes only a single additional
variable (job area). The sample is composed of 310 observations containing complete data. The
SPSS data file is called Usage.sav. As in the previous example, we’ll do a preliminary analysis to
get our bearings.
Double-click Usage.sav
Click Analyze…Descriptive Statistics…Descriptives

Move all variables (except jobarea) into the Variable(s) list box
Figure 6.23 Descriptives Dialog Box
Descriptives produces mean summaries for each of the 0,1 usage variables. The means represent
the proportion of customers who say they use the specific modules. Since all these variables are
on the same scale, it is unlikely that we’ll need to standardize our variables before clustering.
Click OK
Figure 6.24 Software Module Usage
We see that under half the customers use neural nets while a high proportion use the Base and
Professional Statistics. The standard deviations look comparable and all 310 cases have complete
data on the usage variables (see Valid N (listwise)). Now we will explore whether there are
naturally occurring groups based on product usage.
Setting Up K-Means Clustering

Recall from our discussion above that K-Means clustering is useful when you want to cluster a
large number of observations. Also remember that the K-Means algorithm requires that you
specify the number of clusters to be generated. So, in this situation it helps to know something
about the data (i.e., the customer base) before you start. Those familiar with the data or the
research area may have an idea of what to expect. Another option would be to run hierarchical
cluster analysis first (using a sample when working with large data files) and examine the
dendrogram and the agglomeration schedule to narrow down the number of clusters to consider.
Alternatively, you can run a series of K-Means cluster analyses varying the number of clusters
and use the criteria discussed earlier to determine the most useful solution. Probably the best
alternative would be to run the TwoStep clustering procedure specifying the variables as
categorical; TwoStep can automatically select the number of clusters and will explicitly treat the
variables as categorical (multinomial).
To expedite this example we will run a single K-Means analysis based on three groups. This
solution was suggested by using Ward’s method (not shown), and we obtained a more satisfying
solution with three clusters than when we ran K-Means using fewer (2) and more (4, 5, 8, 10)
clusters (these results not shown). As an exercise you might apply K-Means clustering to this data
with a different number of clusters.
Click Analyze…Classify…K-Means Cluster

Move all variables except jobarea into the Variables list box
Enter 3 in the Number of Clusters text box
Figure 6.25 K-Means Cluster Analysis Dialog
Two clusters would have been created by default. Instead, we will look for three.
The Method area permits you to Iterate and classify or to Classify only. We want to iterate and
classify which actually applies the k-means cluster method to the data. The classify only choice is
usually used to assign additional cases to clusters already created. The Iterate button controls the
criteria used to determine when the solution is stable and the algorithm will stop.
The Cluster Centers choice of Read initial would be used if you wished to provide starting points
for the clusters, for example based on previous research, or to experiment with “What if?”
scenarios. The centers are saved in SPSS data file format. Use Write final for example if you
want to create cluster centers from a smaller sample of data, restore the full dataset, and then use
Read initial to read them in.
As in Hierarchical Cluster, the Save button is used to create a cluster membership variable (only
one since K-means creates a specified number of clusters). The Options button permits you to
print ANOVA tests on each clustering variable; these are not taken seriously as actual statistical
tests since clusters that are formed should maximally separate groups, but instead as indicators of
which cluster variables are most important in the formation of clusters. This is useful when many
variables are used in the analysis and you wish to focus attention on the important ones.

Click the Cluster Membership check box
Figure 6.26 K-Means Cluster: Save Dialog
We request SPSS to create a cluster membership variable based on the K-means method. We can
also save a variable containing the distance from each observation to the center of its cluster.
Click Continue
Click the ANOVA table check box in the Statistics area
Figure 6.27 K-Means Cluster Analysis: Options Dialog
We really don’t need the ANOVA table because there are not many usage variables. However, if
we were clustering on 30 variables, then the ANOVA table might suggest a subset of important
variables in the clustering. Unlike the Cluster procedure, the K-means procedure contains an
option to include in the analysis cases that are missing on one or more of the clustering variables.
Click Continue
Click OK
K-Means Results
The Final Cluster Center pivot table displays means rounded to the nearest whole number (since
the original variables are formatted as integers). We need to see more decimal places, so we must
edit the pivot table.
Double-click on the Final Cluster Center pivot table (to open the Pivot Table editor)
Click and drag (or click/shift-click) to select all the cell data
Click Cell Properties
Set the Decimals value to 2
Click OK
Click outside the crosshatched border to close the pivot table editor
Figure 6.28 Final Cluster Centers
The means can be examined to describe the clusters (or we can use a profile plot with a multiple
line chart). Note cluster group 2 has high means on almost all modules; this cluster was named
“Jacks-of-all-Trades.” Can you think of names for the other two clusters? We will discuss them
when viewing the profile plot.
Figure 6.29 F Tests in ANOVA Table
As mentioned earlier, we should not take the significance values seriously here because clustering
is designed to optimally create well-separated groups based on all the input variables. However,
analysts do use the F values to guide them to those variables most important in the clustering
(those with larger F values). Here the Fs range from 40 (professional statistics) to 260 (advanced
statistics) and this provides an indicator as to where our attention should be concentrated, and
which variables most separate the clusters. Stepwise discriminant analysis on the clusters is
sometimes used for the same purpose: to identify the more important cluster variables.
Figure 6.30 Cluster Size
The size of each cluster appears as part of the K-means’ output. Is each of the clusters large
enough to warrant attention? Are there outliers?
Next we’ll create a multiple line chart profiling the three segments. We could do this using the
Graphs menu or create a chart from the Final Cluster Centers table.
Double-click the Final Cluster Centers pivot table (to open the Pivot Table editor)
Click Pivot…Transpose Rows and Columns
Click and drag (or click/shift-click) to select all the cell data
Click Create Graph…Line from the context menu
We transposed the rows and columns in the Final Cluster Centers table because each row in the
table produces a line in the line chart and we want one line for each cluster.
Figure 6.31 Mean Profiles of K-Means Three Segment Solution
The task is now to identify the cluster groups and decide if they make sense. As mentioned
earlier, cluster group 2 tends to use nearly all the modules, thus their “Jacks-of-all-Trades” label.
Group 1 has high values on the more advanced statistical modules and low scores on tables and
mapping. We named this group the “Technical Analysts.” Group 3 is high on presentation tables
and mapping, but lower on the more statistical modules; we named this group the “Presenters.”
If this were our first K-means run on the data, we would run solutions with varying numbers of
clusters and evaluate the results. After arriving at a tentative solution for this data, the next step
would be to examine these clusters from a business (marketing, sales) perspective and see which
other variables (demographics) they relate to. For example, crosstabulation of the cluster groups
with industry, region, or job area might shed light on how to best reach these segments.
Example III: TwoStep Clustering of Telecom Data

To demonstrate the TwoStep clustering procedure we will cluster data from about 1,500 Telecom
customers. The cluster variables include several usage (international minutes, long distance
minutes, local minutes) and calling plan (local plan, long distance plan) variables. The data file
also contains demographic variables and a variable indicating whether the customer terminated
service. In fact, the original purpose of this study was to build a model predicting which
customers were likely to leave (churn). Because of this the file contains an over-sample of those
who churned and is not a representative sample of all customers. Our goal is to cluster customers
based on variables that measure aspects of phone usage (minutes and plans). An alternative
analysis, which we considered, would be to cluster customers only on their telephone activity, but
we wanted clusters based more broadly on how customers use telecom services.
Double-click on Telcom.sav
Figure 6.32 Telecom Customer Data File
There are many zero values for international calls (scrolling down through the data), which means
it is unlikely that this variable follows a normal distribution (one of the assumptions of TwoStep).
Click Analyze…Classify…TwoStep Cluster

Move ldplan and locplan into the Categorical Variables list box
Move internat, local, and longdist into the Continuous Variables list box
Figure 6.33 TwoStep Cluster Analysis Dialog
Two categorical and three continuous variables will be used in the cluster analysis. The Distance
Measure area allows either Log-likelihood (default) or Euclidean distance as the distance
measure. The Euclidean option is not active because categorical variables are included in the
analysis.
The Count of Continuous Variables area summarizes how many of the continuous variables are to
be standardized and how many are not. By default, all continuous variables are standardized so
variables with greater variance do not have greater influence in the analysis when the Euclidean
distance measure is selected; standardized and non-standardized solutions are identical when the
log-likelihood criterion is selected. Standardization can be turned off for selected variables in the
Options dialog. This might be done if all continuous variables were on the same scale and had
similar variation (as rating scale items might) so the additional calculations required for
standardization are not needed, or if you are using Euclidean distance and want variables with
greater variance to have more influence on the solution (this might be done for financial
variables), or if you are using the Log-likelihood distance measure (for which standardized and
non-standardized solutions are identical).
In the Number of Clusters area there are options to have TwoStep determine the number of
clusters automatically (the default) or to specify the number of clusters. If the number of clusters
is determined automatically, then the maximum number of clusters to be considered by the
algorithm must be set, which controls the maximum number of clusters that will be candidates for
the best solution in the second step (clustering); it does not control the number of pre-clusters
formed in the first step (pre-clustering). For example, if you are only interested in a small number
of clusters, then the maximum value could be reduced; alternatively, if you wish to explore
solutions with a large number of clusters, the maximum value could be increased.
There are two criteria available when the algorithm chooses the optimal number of clusters: the
Schwartz Bayesian criterion (default) and the Akaike Information criterion. Both are commonly
used to evaluate model goodness-of-fit. They both involve –2*log-likelihood plus a penalty based
on complexity—here complexity corresponds to the number of clusters—but differ in the form of
the penalty function. Because of their form, typically they are not interpreted directly, but are
used to compare alternative models (smaller values are better).
Figure 6.34 TwoStep Cluster: Options Dialog
The items above the Advanced>> button are standard options; clicking this button will extend the
dialog and advanced options will appear below the button. If Outlier Treatment is selected and
the CF tree fills up during the pre-clustering step, pre-clusters with outliers (defined as those pre-
clusters containing fewer than x% (the default value is 25%) of the number of cases in the largest
pre-cluster) will be placed into a single pre-cluster. This would tend to reduce the occurrence of
small clusters due to outlier data points.
The Standardization of Continuous Variables area indicates which continuous variables (all by
default) will be standardized prior to analysis, and allows you to change the designation. If
continuous variables are not standardized, then those with greater variance and a larger scale have
more influence in the analysis when the Euclidean distance measure is used.
The Advanced options (not shown) allow you to change criteria that influence the CF tree
building in the pre-clustering step and to import a CF tree saved in a previous analysis, which can
be used as a starting point when clustering new data (the imported CF tree is updated with the
new data in the pre-cluster step, then the cluster step is performed).
Click Cancel
Click the Within cluster percent chart check box

Click the Rank of variable importance check box (By cluster is the default)
Click Significance in the Importance Measure area
Click the Confidence level check box
Figure 6.35 TwoStep Cluster: Plots Dialog
The Plots dialog contains several display options that help describe the clusters and provide
insight into which variables were important in cluster formation.
The Within cluster percentage chart option will produce, for each categorical variable, a clustered
bar chart showing the distribution of the variable within each cluster. For each continuous
variable, an error bar chart will present the mean and confidence bands for the variable within
each cluster. These plots can be helpful in describing, and differentiating among, clusters in terms
of the cluster variables.
The Cluster pie chart option creates a pie chart showing the number of cases in each cluster. This
information is also available in a summary table.
The Variable Importance Plots attempt to provide insight into which cluster variables are
important in the formation of a cluster. This can be especially useful when a large number of
variables are used in the analysis. Two orientations are available: By cluster shows all clusters for
each variable, while By variable shows all variables of the same type (continuous, categorical) for
each cluster. The first is better for discovering for which cluster(s) a target variable is important,
while the second orientation makes it easier to see which variables are more important in defining
a target cluster. One option allows the importance measures used in these plots to be the chi-
square (for categorical variables) or t-test (for continuous variables) values. The actual test on
which these statistics are based compares the target cluster to all clusters. For those who often
work with chi-square or t-test results, these summaries may be useful. An alternative importance
measure, which has the advantage of placing both types of variables (categorical and continuous)
on the same scale, albeit not an intuitive one, is based on statistical significance values (–log10 of
statistical significance— the p value—is used) of the tests just described. This transformation
(–log10 of statistical significance) was chosen since it stretches the original scale to 0 to infinity
(instead of from 0 to 1). Also, note that Bonferroni adjustments are applied to these tests to
control the false-positive (apparent differences in the sample that are not present in the
population) error rate. The Confidence Level for these tests can be set (default is 95%) and, if the
option is selected, this value will appear as a reference line in the importance plots. In addition,
non-significant variables can be dropped from these plots.
We request the within-cluster percentage chart to provide a graphical means of interpreting the
clusters. In addition, we request that the importance plots use statistical significance as the
importance measure. Note that when charts are requested, TwoStep automatically creates a
cluster membership variable in the data file, so we won’t need to do this ourselves in the Output
subdialog box.
Click Continue
Click the Output button
Click Information criterion (AIC or BIC) in the Statistics area
Click Create cluster membership variable in the Working Data File area
Figure 6.36 TwoStep Cluster Output Dialog
By default, descriptive statistics on the cluster variables will display for each cluster, as will the
cluster frequencies— both are useful summaries. In addition, we requested that the information
criterion appear in the results; these are used when TwoStep automatically selects the number of
clusters and can provide insight into why it chose a particular number of clusters.
We asked that a cluster membership variable, recording the cluster to which each case is
assigned, be added to the active data file. Options are also available to export the final model and
the CF tree (which can be used as a starting point when clustering additional data) in XML
format.
Click Continue
Click OK
Figure 6.37 Cluster Distribution Table
The cluster distribution table presents the size of each cluster on both count and percentage bases.
The TwoStep auto-clustering algorithm selected a five-cluster solution. Cluster size ranges, on a
percentage basis, from 12.9% to 29% of the sample, so there are no small clusters containing
outliers (if there were, the Outlier Treatment option could be applied).
Figure 6.38 Auto-Clustering Criterion Table
When the option to automatically select the number of clusters is chosen and we request statistics
on the information criterion, the Auto-Clustering table displays summaries for a range of cluster
solutions: from a single cluster to the number of clusters specified in the Maximum box of the
TwoStep dialog (in this case, the default of 15). Recall that these represent agglomerative
hierarchical clustering of the pre-clusters formed in the first (pre-cluster) step. The Schwartz
Bayesian criterion (BIC), a goodness-of-fit measure, is listed for each solution. Here it decreases
(smaller values indicate improved fit), as the number of clusters increases, up through the 10-
cluster solution, and increases with additional clusters.
As the legend indicates, the BIC Change column records the change in BIC from the previous
cluster step. Thus the BIC change for the 2-cluster solution (row 2) represents the difference in
BIC between the 1-cluster and 2-cluster solutions. Large negative changes in BIC indicate that
increasing the number of clusters by one substantially improves the fit. To frame this as a relative,
rather than absolute change, the Ratio of BIC Changes column presents the ratio of the BIC
change to the change when moving from the 1-cluster to 2-cluster solution. Thus, the Ratio of
BIC Changes value for the 4-cluster solution is .475, calculated by dividing its BIC change value
(–975.383) by the BIC change from the 1-cluster to the 2-cluster solution (–2052.953). A large
jump in these ratio values (moving up the table) suggests a promising solution, since adding a
cluster at that point produces a relatively large change in fit. When the ratio value becomes small,
it indicates that the additional cluster results in relatively little improvement, signaling that
clustering has gone too far. When this occurs, the algorithm examines solutions with fewer
clusters using the second criterion, Ratio of Distance Measures.
The Ratio of Distance Measures column contains, for each cluster solution, the ratio of the
distance measure (by default, the change in log-likelihood when two clusters combine) for that
number of clusters divided by the distance measure from the previous solution. Thus the Ratio of
Distance Measures for 2 clusters (row 2) represents the ratio of the change in log-likelihood
between a 1- and 2-cluster solution, divided by the change in log-likelihood between a 2- and 3-
cluster solution. Note that the actual distance measures (change in log-likelihood) do not appear
in the table. Again, a large jump in these ratio values suggests a promising solution, since adding
a cluster at that point produces a relatively large change in the distance measure (change in log-
likelihood).
When TwoStep is set to automatically determine the number of clusters, the two ratio summaries
(Ratio of BIC/AIC Changes, Ratio of Distance Measures) are scanned for a cluster solution. First
the point at which the Ratio of BIC/AIC Changes becomes very small indicates that the number of
clusters is too great, next the Ratio of Distance Measures is examined for fewer clusters,
searching for a jump. Here a five-cluster solution was chosen. In the table, notice the change in
ratio values moving from the 6-cluster solution to the 5-cluster solution. The Ratio of Distance
Measure more than doubles (1.155 to 2.379); this led to the 5-cluster solution being selected from
among its near competitors. In addition, at the same point the Ratio of BIC Changes almost triples
(.154 to .411); both measures support the solution. We examined this table in some detail in order
to illustrate the logic of the clustering step, but typically you would not view it unless you wanted
to review the inner workings of automatic cluster selection.
Next we examine the cluster profiles.
Figure 6.39 Cluster Profiles on Continuous Cluster Variables
Figure 6.40 Cluster Profiles on Categorical Variables
We will now try to interpret the clusters in terms of the clustering variables. If many variables
were involved, we might first examine the attribute importance summary to identify which
variables are most important in defining individual clusters.
Cluster 1 is composed of individuals with high levels of phone usage in all categories. This group
shows a mix of long distance and local plans.
Clusters 3 through 5 have similar values on their phone usage variables and differ mainly in
which plans they participate. Cluster 3 has close to average usage with slightly higher
international call usage than cluster 4 or 5, and contains those with discount long distance and
standard local plans. This is the largest cluster with almost 30% of the customers. All members of
cluster 3 have the discount long distance plan.
Clusters 4 and 5 both show very similar usage patterns, primarily differing in choice of plans.
Everyone in cluster 5 is on the discount local plan while those in cluster 4 are all on the standard
local plan. Also, all members of cluster 4 are on the standard long distance plan, while cluster 5
members are split between the discount and standard plans. Since the main point of
differentiation seems to be the type of local plan, it might be worth investigating why groups with
such similar local calling behavior should be on two different plans. Of course, analyzing total
minutes masks such factors as time of day calls were made, etc., but still, this serves to suggest
the value of clustering.
Cluster 2’s members make no international or long distance calls and have no long distance plan.
This would be the group whose long distance service is provided by another company, which is a
customer group that might be worth looking at more closely.
This information is also presented graphically in the Within Cluster plots. We will examine one
for each variable type (continuous, categorical), although if you were using these plots as the
main means of interpreting the clusters, you would examine them all (or at least the plots for all
influential variables).
Scroll down to the Within Cluster plot for Long Distance Plan
Figure 6.41 Within Cluster Plot for Long Distance Plan
In this bar chart, the five clusters, plus an overall group summary, are positioned along the
vertical axis. For each cluster, the bars show the percentage of customers under each long
distance plan. It is clear that cluster 2 contains only customers with no long distance plan, cluster
4 only customers with the standard plan, and cluster 3 only customers with the discount plan.
These percentages appeared in the cluster profiles table.
Scroll down to the Within Cluster plot for Longdist
Continuous cluster variables are presented in an error bar chart. The five clusters are placed along
the horizontal axis and the overall mean appears as a reference line. The bars represent the 95%
confidence bands, which here are fairly narrow. It is clear that people in cluster 1 spend
considerably more time on long distance calls than those in other clusters. Cluster 2 is at 0 and
only has a single bar because there was no within-cluster variation (all members had no long
distance call time). The mean values (boxes) also appear in the cluster profiles centroids table,
examined earlier. In practice, the remaining within-cluster plots would be viewed to better
interpret the clusters (the process we went through with the cluster profiles tables).
Figure 6.42 Within Cluster Plot for Long Distance Call Usage
Simultaneous 95% Confidence Intervals for Means
60
50
40
longdist
30
20
10
1 2 3 4 5
Cluster
Reference Line is the Overall Mean = 27
Scroll down to the Clusterwise Importance plot for Long Distance Plan
Figure 6.43 Importance Plot for Long Distance Plan
We requested that variable importance be ranked by cluster, so a separate plot for each variable
presents the relative importance of the variable in defining individual clusters. This is
accomplished by performing a series of chi-square tests (for a categorical cluster variable) or t-
tests (for a continuous cluster variable) in which each cluster group is tested against the overall
group. Since multiple tests are performed (one for each cluster), Bonferroni adjustments are
applied to control the false-positive error rate. The Tolerance line in the plot represents the critical
value for statistical significance (here the Bonferroni adjusted .05 level, since 95 was selected as
the Confidence level). If the bar for a cluster extends beyond the tolerance line, then that cluster
significantly differs from the overall group.
Since Significance was chosen as the importance measure (in the TwoStep Plots dialog), the
horizontal axis represents the statistical significance of the test. Because significance values fall
between 0 and 1, and interest is in values below .05 (a small band), a negative log transformation
is applied to the significance scale. This accomplishes two things; the 0 to 1 scale is stretched to a
0 to infinity scale, and larger values mean greater significance.
Viewing the plot, we see that the long distance plan variable was most important in distinguishing
cluster 2 from the overall group (cluster 2 contained all customers with no long distance plan).
Each of the other groups also differed significantly from the overall group in the distribution of
long distance plans, but this variable is much more important for cluster 3 than for, say, cluster 5.
In practice, each of these plots could be examined to obtain a complete picture of which variables
were most important in the definition of which clusters. However, here we limit ourselves to
viewing one importance plot for a continuous cluster variable (Longdist).
Scroll down to the Clusterwise Importance plot for longdist
Figure 6.44 Importance Plot for Long Distance Minutes
For a continuous cluster variable, t-tests are performed, comparing each cluster group mean to the
overall group mean. We see that long distance minutes were important in distinguishing cluster 1
from the overall group. Recall that cluster 1 had high usage values in general. Also, long distance
minutes were relatively unimportant in differentiating clusters 3, 4, and 5 from the overall group
(although the differences from the overall mean are statistically significant since the bars are
above the tolerance line).
This plot does not show the direction of the mean difference, but we could return to the cluster
profile tables or within-cluster plots for this information. Alternatively, if we chose the Chi-
square or t-test of significance as the importance measure, we could tell whether a cluster’s mean
was above or below the overall mean from the sign of the t-test statistic presented in the plot.
Also, notice that a test is not performed for cluster 2 because it had no variance (all values were
0) on long distance minutes. While this is an unusual circumstance, it is important to note that
long distance minutes is an important variable in defining cluster 2, although this can’t be
determined by performing a standard t-test.
In summary, the importance plots provide insight into which cluster variables are influential in
the formation of which clusters. This is especially helpful when many variables are used in the
clustering.

Scroll to the right
Figure 6.45 Cluster Membership Variable in Data Editor
The new cluster variable, here named TSC_3275 (number part of variable name is randomly
generated), has been added to the data file. Value labels can be added to describe the clusters and
the new variable can be used in further analysis; for example, we could see how it relates to
customer demographics.
Chapter 7
Factor Analysis
Topics
• Uses of Factor Analysis
• What to Look for When Running Factor Analysis
• Principles
• The Idea of a Principal Component
• Factor Analysis Versus Principal Components
• Determining the Number of Factors
• Rotation
• Factor Scores & Sample Size
• Methods
• Overall Recommendations
• An Example: 1988 Olympic Decathlon Scores
• Principal Components with Orthogonal Rotation
• Principal Axis Factoring with an Oblique Rotation
Introduction
Factor analysis performs data reduction to simplify some number of measured variables by
identifying a smaller number of underlying traits, attitudes, or beliefs that influence those
variables. The goal of factor analysis can be either exploratory or confirmatory. The focus of this
chapter will be on exploratory factor analysis (hereafter factor analysis). Confirmatory factor
analysis takes things a step farther, attempting to use factoring methods to test and confirm
hypotheses (for example Amos can be used to test specific models about how underlying factors
relate to measured variables). In this chapter we will first consider some applications of factoring
and then go on to consider what to look for when running factor. Some background principles of
factor analysis will be covered along with comments about popular factor methods, and overall
recommendations will be made. To illustrate factoring, we will factor analyze the performances
on the events in the 1988 Olympic Decathlon.
Uses of Factor Analysis

Factor analysis is a statistical technique whose main goal is data reduction. A typical use of factor
analysis is in survey research, where a researcher wishes to represent a number of questions with
a small number of hypothetical factors. For example, as part of a national survey on political
opinions, participants may answer three separate questions regarding environmental policy,
reflecting issues at the local, state and national level. Each question, by itself, would be an
inadequate measure of attitude towards environmental policy, but together they may provide a
Factor Analysis 7 - 1
better measure of the attitude. Factor analysis can be used to establish whether the three measures
do, in fact, measure various aspects of the same concept. If so, they can then be combined to
create a new variable, a factor score variable that contains a score for each respondent on the
factor. Factor techniques are applicable to a variety of situations. In our example in this chapter, a
researcher may want to know if the skills required to be a decathlete are as varied as the ten
events, or if a small number of core skills are needed to be successful in a decathlon. You need
not believe that factors actually exist in order to perform a factor analysis, but in practice the
factors are usually interpreted, given names, and spoken of as real things.
What to Look for When Running Factor Analysis

There are two main questions that arise when running factor analysis: how many (if any) factors
are there, and what do they represent? These questions are related because in practice you rarely
retain factors that you cannot identify and name. While naming factors may not stump a creative
researcher for long, developing useful labels for factors is necessary. That is, interpretability is an
important criterion when deciding to keep or drop a factor. Two useful technical aides,
eigenvalues and percentage of variance accounted for, will be discussed because of their utility in
choosing the number of factors. However, these technical measures are only a guide and they
provide no absolute criteria when factoring. In addition, factor loadings or lambda coefficients
will be explained. They provide information as to which factors are highly related to which
variables and thus give insight into what the factors represent. Finally, rotation of the factors is
done to facilitate interpretation. The two general classes of rotation, orthogonal and oblique, will
be reviewed.
Principles
The theoretical basis for factor analysis is that variables are correlated because they share one or
more common components. That is, correlations among variables are explained by underlying
factors. Mathematically a one-factor model for three variables can be represented as follows (Vs
are variables, Fs are factors, Es represent random error).
V1 = L1*F1 +E1
V2 = L2*F1 + E2
V3 = L3*F1 + E3
Each variable is composed of the common factor (F1) multiplied by a loading coefficient (L1, L2,
and L3—the lambdas) plus a random component (Ei). If the factor were directly measurable
(which it isn’t) this would amount to a simple regression equation. Since these equations cannot
be solved as given (the Ls, Fs and Es are unknowns), factor analysis takes an indirect approach. If
the equations above hold, then consider why variables V1 and V2 correlate. Each contains an error
(the Ei, assumed to be random or unique) component that cannot contribute to their correlation
(errors are assumed to have 0 correlation). However, they share the factor F1, so if they correlate
it should be related to L1 and L2, (the factor loadings). If this logic is applied to all the pairwise
correlations, the loading coefficients can be estimated from the correlation data. Thus one factor
might account for the correlations in a set of variables. If not, the equations can be easily
generalized to accommodate additional factors. There are different approaches to fitting factors to
a correlation matrix (least squares, generalized least squares, maximum likelihood, etc.), which
have given rise to a number of factor methods.
There is an important caveat regarding factor analysis. While the same correlations can always be
derived from a known set of factor loadings, the reverse is not true. Different factor solutions can
be consistent with a single set of correlations (e.g., the same correlations can be produced by
several sets of factor loadings, or by a varying number of factors). This makes determining the
underlying structure (the real factors) difficult. The solution involves adoption of two
assumptions. First, since factor analysis is based on correlations, only linear relations are assumed
to be present. So, a basic assumption of factor analysis is that the variables used in factor analysis
are linear combinations of some underlying factors. This eliminates nonlinear solutions and
simplifies the solution. To resolve which of two (or more) otherwise equivalent solutions is
chosen the researcher picks the more parsimonious one. Also, the researcher should try to confirm
the validity of the solution. Typically the first steps in confirming a factor solution are to validate
it with a different factor method and on a new sample. In addition, one might conduct additional
tests to evaluate factors. The general form of these additional tests would be “If this is really a
factor then the following should be true.” Finally, you may want to use confirmatory factor
analysis to evaluate a specific causal structure (as you become more familiar with the technique).
The Idea of a Principal Component

A concept related to most methods of factoring is the idea of a principal component. A principal
component is a linear combination of observed variables that is independent (orthogonal) of other
components. The first principal component accounts for the largest amount of variance in the
input data. The second component accounts for the largest amount of the remaining variance in
the data, and so on. An important implication of components being orthogonal is that they are
uncorrelated. Assuming that it is appropriate for the data, having factors that are uncorrelated
facilitates interpretation. In addition, orthogonal components help to resolve issues of
multicollinearity. For example, in survey research (e.g., customer satisfaction) it is common to
have many questions addressing a specific issue (e.g., customer service). It is likely that these
questions will be highly correlated. It is problematic for most statistical procedures to employ
highly related variables together (such as for predictors in regression). One solution is to use
factor scores, computed from factor loadings on each orthogonal component, in place of the set of
highly correlated variables.
Factor Analysis versus Principal Components

Differences among factoring methods are due to how each solves for components. For example,
the diagram below is a correlation matrix composed of five variables. Principal components
attempts to account for the maximum amount of variance in the set of variables. Since the
diagonal of a correlation matrix (the ones) represents standardized variances, each principal
component can be thought of as accounting for as much of the variation remaining in the diagonal
as possible. Principal axis factoring, as one of a number of factor methods, attempts to account
for correlations between the variables, which in turn accounts for some of their variance.
Therefore, factor focuses more on the off-diagonal elements in the correlation matrix. So while
both methods attempt to fit a correlation matrix with fewer components or factors than variables,
the computational focus of each is different. Of course, if principal components accounts for
variance in variables V1 and V2 it must also be accounting for their correlation. Likewise, if a
factor method accounts for the correlation between V1 and V2, it must account for at least some of
their variance. Thus, there is overlap in the methods and they usually yield similar results (as we
will see). Often principal axis factoring is used when there is interest in studying relations among
the variables, while principal components is used when there is a greater emphasis on data
reduction and less on interpretation.
This said, there is a technical distinction in that a principal component can be explicitly defined as
a unique combination of variables, while a factor cannot be. Because of this, some statisticians
treat principal components and factor analysis as completely disparate methods. Since we are
taking a pragmatic approach, we present them as related techniques often used to solve the same
problem.
Figure 7.1 Correlation Matrix
Number of Factors
Several technical measures are available to guide you in choosing a tentative number of factors or
components. Eigenvalues are the most commonly used index for determining how many factors
to use from a factor analysis. They are fairly technical measures, but when principal components
are derived, their values represent the amount of variance in the variables that is accounted for by
a component (or factor). Referring back to the correlation matrix in Figure 7.1, there are five
variables in this matrix and therefore 5 units of standardized variance to be accounted for in the
solution (the sum of the values in the diagonal).
Since an eigenvalue for a component is an index of the amount of this variance it accounts for, it
is also an index of whether it is likely to be useful. That is, eigenvalues logically lead to a rule of
thumb for determining the number of factors to take from a factor analysis: take as many factors
as there are eigenvalues greater than 1. Why? Well, if an eigenvalue represents the amount of
standardized variance in the variables accounted for by a factor, then it should represent at least
as much variance as is contained in a single variable (1 unit). Thus an eigenvalue greater than 1
must account for variation in more than one variable. Now an eigenvalue can be less than 1 and
still account for variation shared among several variables (for example 1/3 of the variation of
each of three variables for an eigenvalue of .99). So the eigenvalue of 1 rule is only applied as a
heuristic and is not the final word. Another aspect of eigenvalues to note (for principal
components and most factor methods under orthogonal rotation) is that their sum can be taken as
a measure of percentage of variance accounted, which is helpful when evaluating the overall
value of a solution. The proportion of variance in a variable explained by the factors is called its
communality. Finally, it is important to reiterate that an overriding concern is that a factor must
make sense. For this reason, analysts might drop factors with eigenvalues over 1 that cannot be
interpreted, or retain interpretable factors with eigenvalues below 1.
Rotation
Recall that a fundamental problem in factor analysis (but not in principal components analysis) is
the possibility that several different factor solutions can account for the same set of correlations.
This problem can be illustrated in a two-dimensional space. Figure 7.2 depicts a two-factor
solution based on six variables with the axes representing the factors. Variables are located
(plotted) in this space using factor loadings derived from a specific factor solution. However, the
orientation of the factors in the two-dimensional solution is indeterminate. This means that the
location of the variables in relation to the factor axes is likewise arbitrary. Simply stated, no
orientation of the axes (i.e., the factors) is any more correct than another. This led researchers to
develop the concept of a “simple solution,” an axis orientation in which factors would be strongly
related to some variables, but weakly related to others. Rotation, in this context, refers to moving
the axes to obtain this result. The ideal result of rotation is that each variable will have a high
loading on a single factor (have a lambda coefficient near one) and small loadings (near zero) on
the other factors. Therefore, the net effect of rotation, as well as its main motivation, is to
facilitate interpretation.
Figure 7.2 Two Factors Based on Six Variables
If the rotation is done in a way so that the axes remain perpendicular, then the factors are
orthogonal (or uncorrelated, see F1' and F2'). Alternatively, if you allow the relationship between
the axes to be other than perpendicular (i.e. oblique; see F1" and F2"), then variables may load
differently on factors and an even clearer interpretation may emerge. However, oblique rotation,
by definition, means that the factors will be correlated. Orthogonal rotations are typically done
when data reduction is the objective and there is a desire to create uncorrelated factors. Oblique
rotations are used if there are reasons (usually theoretical) to allow factors to correlate. For
example, a researcher might investigate whether two factors measuring mathematical and verbal
ability correlate because past research makes this likely, or because a theory makes this
prediction.
For each class of rotations there are different variations. SPSS provides three orthogonal
rotations. The most popular is the varimax rotation, which attempts to simplify interpretation by
maximizing the variances of the variable loadings on each factor (i.e., it tries to simplify the
factors). The quartimax rotation attempts to simplify the solution by finding a rotation that
produces high and low loadings across factors for each variable (i.e., it tries to simplify the
descriptions of the variables). Equimax rotation is a compromise between varimax and quartimax.
SPSS provides two oblique rotations, Oblimin and Promax. Promax runs faster on larger data sets
(many variables).
Factor Scores and Sample Size

If you are satisfied with a factor solution, you can request that a new set of variables be created
that represents the scores of each observation on the factors. These are calculated by multiplying
each of the original variables (in standardized z score form) by a weight coefficient (derived from
the lambda coefficients) and summing these products. These factor variables can then be used as
the input variables for additional analysis. They are usually normed to have a mean of zero and
standard deviation of one. An alternative some analysts prefer is to use the lambda coefficients to
judge which variables are highly related to a factor, and then compute a new variable which is the
sum or mean of that set of variables. This method, while not optimal in a technical sense, keeps
(if means are used) the new scores on the same scale as the original variables (assuming the
variables themselves share the same scale). This can make the interpretation, and importantly, the
presentation, more straightforward. Essentially, subscale scores are created based on the factor
results, and these scores are used in later analyses.
Since factor analysis is a multivariate statistical method, the rule of thumb for sample size is that
there should be from 10 to 25 times as many observations as there are variables used in the
analysis. This is because factor is based on correlations, and for p variables there are p*(p-1)/2
pairwise correlations. Think of this as a desirable goal and not a formal requirement (technically
if there are p variables there must only be p+1 observations for factor to run, but do not expect
reasonable results). In practice, factoring is done when the ratio of observations to variables is
below 10. Remember that every factor solution, regardless of the sample size, needs to be
evaluated further in terms of validation and confirmation.
Methods
Counting principal components as a separate method, there are seven factoring techniques
available in SPSS. Principal axis has already been described. Unweighted least-squares produces
a factor solution that minimizes the residual between the observed and the reproduced correlation
matrix. Generalized least-squares does the same thing, only it gives more weight to variables
with stronger correlations. Maximum-likelihood generates the solution that is the most likely to
have produced the correlation matrix if the variables follow a multivariate normal distribution.
Alpha factoring considers variables in the analysis, rather than the cases, to be sampled from a
universe of all possible variables. As a result, eigenvalues and communalities are not derived
from factor loadings. Finally, image factoring decomposes each observed variable into a common
part (partial image) and a unique part (anti-image) and then operates with the common part. The
common part of a variable can be predicted from a linear combination of the remaining variables
(via regression), while the unique part cannot be predicted (the residual).
Overall Recommendations
Principal components and principal axis factoring are the most commonly used methods.
Principal components will run with near or even multicollinear data, which gives it an advantage
over most factor methods. Factor methods other than principal axis and maximum likelihood are
not commonly used. Rotation of the factor matrix is almost always done, and Varimax is the most
common rotation, to keep the factors orthogonal.
An Example: 1988 Olympic Decathlon Performances

We will apply factor analysis to scores for 34 athletes participating in the 1988 Olympic
Decathlon. The question we investigate is whether a small number of athletic skills account for
performance in the ten separate decathlon events. The data are contained in an SPSS data file
called Olymp88.sav. We will analyze the data with both principal components and principal axis
factoring methods. We will also use both orthogonal and oblique rotations. Finally, we will save
factor scores as variables.
Double-click on Olymp88
As a first step, we will take a look at a listing of the variable labels and generate correlations.
Click the Variable View tab in the Data Editor window
Figure 7.3 Variable Information Displayed in the Variable View Sheet
The labels remind us that there are four track (100 meters, 400 meters, 110 meter hurdles, 1500
meters) and six field events (long jump, shot put, high jump, discus, pole vault, and javelin). Also
recall that the event takes place over a two-day period and that victory is determined by points
awarded to the athletes on the basis of their performance in each event.
Looking at Correlations
We will run correlations to get a sense of the relationships between these variables. By default,
Pearson product moment correlation coefficients display. For rank ordered data, SPSS can
produce Spearman rank correlations and Kendall’s coefficients of concordance (but these will not
be used by factor, in any case).
Click Analyze…Correlate…Bivariate
Move all variables except score into the Variable(s) list box
Figure 7.4 Completed Correlation Dialog Box
Click OK
Figure 7.5 Correlations of Decathlon Scores
Note that the correlation table was edited so the inner column labels are rotated and the sample
sizes hidden.
We see that there are moderate to strong relationships among discuss, javelin, pole vault, and shot
put. However, there is no relationship between discus and either the 400 or 1500-meter runs. The
hurdles event is negatively correlated with all of the field events, but positively correlated with
the 100 and 400 meters, and it has no relationship to the 1500 meters. A pattern quickly emerges
as we examine the correlations. For the most part, track events correlate positively with each
other, as do field events, while track and field events correlate negatively. Also, recall that higher
values (distances) for field events and lower values (times) for track events indicate superior
performance. It may be that the decathlon largely taps into a small number of underlying athletic
traits. Possibly speed and strength?
Principal Components with Orthogonal Rotation

First we will run principal components analysis with a varimax rotation to aid interpretation. We
approach this as an exploratory analysis by requesting a diagnostic plot (scree plot) and
descriptive statistics.
Click Analyze…Data Reduction…Factor

Move all variables except score into the Variable(s) list box
Figure 7.6 Factor Dialog
Click the Extraction button

Click the Scree Plot check box in Display area
Figure 7.7 Factor Analysis: Extraction Dialog
The default method is principal components. The Extract area indicates that SPSS will select as
many factors (here components) as there are eigenvalues over 1 (we discussed this rule of thumb
earlier). Notice you can change this rule or specify a number of factors; you might do this if you
prefer more or fewer factors than the eigenvalue rule provides. We ask for a scree plot, which
some analysts use to guide the selection of the number of factors. By default, the unrotated
solution will appear. Since we will request a rotated solution, this box could be checked off. We
leave it as is so that we can briefly compare the unrotated to the rotated solution. Next we review
the Rotation dialog box.
Click Continue
Click the Rotation button
Click the Varimax option button
Click the Loading plot(s) check box
Figure 7.8 Factor Analysis: Rotation Dialog
We have selected varimax, the most popular rotation, to ease the task of interpreting the factors.
By default, the loadings are displayed in a table. The loadings plot is a scatterplot placing each
variable in the factor space based on its factor loadings. Some analysts prefer to view the plot
instead of the table (matrix) of loadings when the number of factors is small (2 or 3). In a
situation where a solution contains more than three factors, only the first three factors will appear.
There are options available in syntax for plotting a larger number of factors. Next we pick some
formatting options that organize and simplify the factor-loading matrix.
Click Continue
Click the Sorted by size check box in the Coefficient Display Format area
Click Suppress absolute values less than: check box and replace .1 with .3 in the text
box
Figure 7.9 Factor Analysis: Options Dialog
The factor procedure has several missing value options, but since the data for each athlete is
complete we can leave the listwise default in place. Pairwise deletion is often chosen when some
variables have missing values for only a small number of cases (although there can be
computational difficulties when using pairwise, so it is usually not recommended). The Sorted by
size option will have SPSS display the variables in descending order by absolute value of their
loading coefficients. The loadings will also be grouped by the factor on which they load highest.
This improves the readability of the loading matrix. To further aid this effort, we suppress loading
coefficients less than .3 in absolute value. Thus we will only see larger loadings (small values are
replaced with blanks) and not be distracted by small ones. These options are not required; they
are used only to make the interpretive task easier. Finally, we request some summary statistics
from the Descriptives subdialog box.
Click Continue
Click the Descriptives button
Click the Univariate descriptives check box
Click the KMO and Bartlett’s test of sphericity check box
Figure 7.10 Factor Analysis: Descriptives Dialog
The Descriptives dialog can be used to request correlations and means (Univariate descriptives).
In addition, we can turn off the display of the initial solution consisting of as many components or
factors as variables, if we do not want to compare it to the final solution (we do). Several matrices
can be requested for those who want to evaluate the suitability of the data for factor analysis. The
Kaiser-Meyer-Olkin (KMO) measure is an indicator of how well suited the sample data are for
factor analysis. It is the ratio of the sum of the squared correlations for all variables in the analysis
to the squared correlations of all variables plus the sum of the squared partial correlations for all
variables. The denominator of this ratio increases with variation that is unique to pairs of
variables (partial correlations); consequently, the value of KMO varies from 0 to one. Small
values of KMO indicate that factor analysis may not be appropriate for the data. Kaiser (1974)
suggests that values of .9 or higher are great and values below .5 are unacceptable. Bartlett’s test
of sphericity evaluates the null hypothesis that the correlation matrix is an identity matrix (all the
values in the diagonal are 1 and all the off-diagonal values (correlations) are zero), which would
indicate no relationships among the variables, and thus no basis on which to proceed with factor
analysis. A significance test result allows us to reject this hypothesis. Of the two, most analysts
prefer the KMO statistic.
Each section of output in the Factor Analysis section, starting with descriptive statistics, will be
described in turn.
Figure 7.11 Descriptive Statistics
The descriptive statistics reported are means, standard deviations, and sample sizes. It’s hard to
believe that these are mean values (recall they are Olympic-class athletes)! Moreover, the
standard deviations are tiny. With ten variables but only a little more than 30 observations, the
ratio of variables to cases is less than desirable. We will have to be cautious with our results.
Figure 7.12 KMO and Bartlett’s Test
The good news is that the KMO statistic is well above an unacceptable value and we can reject
the hypothesis that the correlation matrix is an identity matrix.
Figure 7.13 Communalities
Communalities are reported for both the initial and final (Extraction) factor solutions. Initially,
the principal component method derives components until all possible variance is accounted for,
in this case all 10 standard units of variance. Typically this means you end up with as many
factors as variables, which defeats the purpose of the analysis but provides a useful starting point.
Because the initial solution is complete (as many components as variables) using principal
components, communalities for the initial solution are equal to 1, that is, all of the variance for
each variable is accounted for by the solution. In the extracted solution only factors with
eigenvalues greater than 1 are retained (the default criterion) which means the solution will
probably not account for all of the variability in the variables. As a result, the communalities are
less than 1. Communalities in the extracted solution should be examined for low (near zero)
values. A low communality indicates a variable that shows little variation in common with the
others, and may lead to removing the variable from analysis.
Figure 7.14 Total Variance Explained
The history of the derived components is shown in the Total Variance Explained table. Note that
the first component accounts for the most variance (about 50%), the second for the second
greatest amount (21%), and so on (see Initial and Extraction summaries). Two components are
extracted because only two have eigenvalues greater than 1. Together they account for
approximately 71% of the variance in the decathlon scores. Also note that while the rotated
(Rotation section) and unrotated (Extraction section) solutions each account for the same total
amount of variance, the amount of variance attributed to each component differs between the
solutions. In the rotated solution each component accounts for approximately the same amount of
variance, but in the unrotated solution the first component accounts for far more variance than the
second.
Figure 7.15 Scree Plot
The scree plot aids in deciding how many factors/components to select. It plots the eigenvalues
on the vertical axis and the factor numbers on the horizontal axis. What we look for is an elbow,
or what would amount to the “scree” at the bottom of a mountain, which would suggest a
transition from large eigenvalues to small ones. In our graph there is such a bend at the third
factor, which indicates a two-factor solution. This is consistent with the heuristic of extracting
components with eigenvalues greater than 1.
Figure 7.16 Component Matrix
Figure 7.17 Rotated Component Matrix
The component matrix lists the factor loadings for each variable in the unrotated solution. In an
orthogonal solution these factor loadings are both the correlations and the regression weights
between factors (here components) and variables. Notice that almost every variable loads onto the
first component in the component matrix. While it is helpful that factor loadings of less than .3
are not displayed, with the first component accounting for so much of the variability in the
decathlon scores it is hard to see any separation between variables.
The rotated component matrix indicates a clearer separation. Recall that the varimax rotation tries
to simplify factors. Here all the track events load positively onto the first component. Having
these variables factor together makes sense, but less obvious is the fact that this component also
has negative loadings for all the jumping events (long jump, high jump, and pole vault). If we
take this component as an index of running ability (perhaps speed), at first glance you might think
the jumping loadings indicate that running ability hinders performance in jumping events in
which speed might be expected to play a role. However, this is reconciled by recalling that low
scores (times) in running events and high scores (distances) in field events are good. Thus the
negative loadings are explained by the fact that a good score is high in field (large distance) and
low (fast time) in track. The high positive loadings for the second component all correspond to
field events. The second component has small negative loadings for the 100 meters and the
hurdles, and a positive loading on the 1500 meter run. This factor may represent the strength of
the decathlete.
Additionally, there are two events—the pole vault and high jump—that load about the same on
both components. What might this mean about the skills required for these events?
The Component Plot is a graphical depiction of the loadings from the Rotated Component Matrix.
Although it adds no new information, it can be helpful in interpreting the components. However,
caution is necessary because variables that appear close on the plot are not necessarily on the
same component. Thus the long jump is closer to the three events on component 2, but it has a
large negative score on component 1 and a small positive score on component 2. As a result, it
belongs on component 1.
Figure 7.18 Component Plot in Rotated Space
Finally, as part of a rotated solution the Component Transformation matrix is provided (not
shown). The values in this table are technical in nature and describe how the factor rotation is
accomplished, but have little interpretive value.
Principal Axis Factoring with an Oblique Rotation

Now we will try to validate our first factor solution using a different method (here factor analysis)
with an oblique rotation. In the process, we request that factor scores be saved as variables.
Click the Dialog Recall tool button , then click Factor Analysis
Click the Extraction button (not shown)
Click the drop-down arrow next to Method and select Principal axis factoring
Click Continue
Click Rotation (not shown)
Select Direct Oblimin, leave delta at the default value (0) requesting the most oblique
factors possible
Type 50 in the Maximum Iterations for Convergence text box
Click Continue
Click Scores button in Factor dialog box
Click Save as variables check box
We set the number of iterations higher because an earlier run showed that the rotation didn’t
converge at the default of 25 iterations. You must always check to see whether the extraction and
rotation (if requested) have converged for a factor solution. If not, you set the number of
iterations higher, as we have done here. There is no penalty for doing so.
Figure 7.19 Factor Analysis: Scores Dialog
By checking Save as variables, SPSS will create a new variable for every factor extracted. They
are derived from factor loadings. This process involves solving for the factors as a function of the
variables (the loadings themselves are from equations expressing the variables as a function of the
factors). There are several methods for doing this and they differ in fairly technical ways; the
default method (Regression) is quite adequate.
Click Continue, then click OK

Scroll to variable fac1_1
We now have two variables containing scores for decathletes on the two factors derived from the
principal axis factoring. They are in z score (standardized) form having means of zero and
standard deviations of 1. The first decathlete’s score on the second factor (FAC2_1) is more than
1 (1.083), indicating that he is more than a standard deviation above the mean on this factor.
Figure 7.20 Component Variables in Data Editor
Switch to the Viewer window
Now let’s turn to the output from the principal axis factoring. Notice that the communalities for
the initial solution (not shown) are less than 1 because the method is trying to account for
correlations and not variance. This method, like principal components, also extracted two factors.
Note that the total variance explained in the initial solution is the same as the principal
components solution since principal components is used as a starting point for the factor methods,
but the variance explained in the oblique rotation is smaller. This occurs because factor methods,
unlike principal components, do not maximize the accounted for variance. The scree plot looks
the same and the factor-loading matrix is only slightly different than the component matrix
despite the differences in method.
Figure 7.21 Pattern Matrix
Rather than getting one matrix for the rotated factor loadings, as we did with principal
components, we see that principal axis factoring produces a pattern matrix and a structure matrix.
Because the factors are correlated the rotated factor loadings no longer represent both the
correlations and the regression weights (Beta coefficients) between factors and variables. The
pattern matrix reports the standardized linear weights and the structure matrix displays the
correlations between the factors and the variables. For interpretation the pattern matrix is
preferred.
Figure 7.23 Factor Correlation Matrix
An additional output table is the factor correlation matrix. The correlation coefficient between the
two factors is .263, a modest positive correlation.
Figure 7.24 Factor Plot in Rotated Factor Space
Finally, the factor plot in rotated factor space is consistent with the principal component analysis.
There seems to be as clear a separation between the track and field events as in the principal
component solution.
Additional Considerations
Two different data reduction methods (principal components and principal axis factor analysis)
indicate that decathlon scores separate into two factors (or components), which correspond to
track and field abilities. Reading into the results, we could say that these factors represent the
athletic abilities of speed and strength. However, at this point the results only provide a premise
on which to ask additional questions. We should validate this analysis on a new set of decathlon
scores, perhaps with a wider sampling scheme than just Olympic-class male athletes. The results
indicate that a large proportion of the variance among athletes in the events seems to be captured
by two components or factors. Additional questions might include: 1) If several events were to be
dropped, which are least central to the factors? 2) If a shortened form of the decathlon (only two
events) were to be held, which two events would best represent the factors? 3) Would the winner
still be the winner with only those two events?
Chapter 8
Loglinear Models
Topics
• What are Loglinear Models?
• Relations Among Loglinear, Logit Models, and Logistic Regression
• What to Look for in Loglinear and Logit Analysis
• Assumptions
• Procedures in SPSS that Run Loglinear or Logit Analysis
• Model Selection Example: Location Preference
• Appendix: Logit Analysis with Specific Model (Genlog)
Introduction
Many of you are familiar with data analysis of two-way crosstabulation tables involving counts,
and the associated chi-square test of independence. Such analyses of categorical (nominal)
variables can be generalized to investigate relations in higher-way tables (3-way, 4-way, etc.). For
example, suppose you are interested in voting choice (someone votes for candidate A, candidate
B, or does not vote) as it relates to her gender, religion, age, and income (in categories). Any two
variables can be examined and tested in a crosstabulation table, but in order to investigate more
complicated relationships a more general analysis is necessary. Loglinear models provide a
framework to examine, test and estimate relationships among categorical variables. If one of the
variables is considered a dependent measure to be predicted from the others, then a variant called
logit analysis can be applied. In this chapter we will introduce and provide a framework for
loglinear and logit models. However, we should make clear that our presentation is only a brief
introduction and entire university courses are devoted to this topic. For readable introductions to
this area, see Agresti (1996) or Fienberg (1980); for a more technical and complete discussion see
Agresti (2002) or Bishop, Fienberg and Holland (1975).
What Are Loglinear Models?

We begin our discussion of loglinear models with a specific and straightforward example. The
crosstabulation below displays geographic preference and region of origin. The data are taken
from the Stouffer et al. (1949) study of American soldiers.
Loglinear Models 8 - 1
Figure 8.1 Crosstabulation Table: Location Preference by Geographic Origin
We have a very simple table displaying the relationship between geographic origin and location
where the soldier wished to be stationed. Not surprisingly, respondents wanted to be stationed in
the region in which they grew up. If we request a chi-square test of independence of the rows and
columns in the population, we obtain the results below.
Figure 8.2 Chi-square Test of Independence: Location Preference by Geographic Origin
The chi-square test is significant indicating that geographic preference and region of origin are
related in the population of soldiers. This chi-square is calculated by measuring the discrepancy
between the observed cell counts and the expected cell counts assuming independence of the row
and column variables. If nij represents the count in cell (i,j), that is, row i and column j, and mij
represents the expected cell counts under a specified model, then we have:
ln(mij) = u + ai + bj + (ab)ij for i, j = 1,2 where ln represents the natural log.
The equation states that the natural log of the expected cell counts is equal to the effects of u (the
overall intercept), ai (main effect of the row variable, here geographical origin), bj (main effect of
the column variable, here location preference), and (ab)ij (the interaction term). The chi-square
statistic obtains expected values for the cell counts (mij) assuming (ab)ij is 0, and compares these
estimates to actual cell counts (nij). Such models are linear models in the natural log scale, as
ln(mij) is linearly related to the parameters ai, bj, (ab)ij); for this reason, they are called log-linear
models. Since the equation above can be easily generalized to allow for additional variables (just
add subscripts and parameters), log-linear models provide a general way to test for relationships
and build prediction equations when the basic measures are counts of observations within
categories.
Although it is tedious, we can obtain a loglinear expression for the expected counts in every cell
of the table.
ln(m11) = u + a1 + b1 + (ab)11 (Cell 11)
ln(m12) = u + a1 + b2 + (ab)12 (Cell 12)
ln(m21) = u + a2 + b1 + (ab)21 (Cell 21)
ln(m22) = u + a2 + b2 + (ab)22 (Cell 22)
The problem here is that we have four data points (cells) and nine parameters to estimate (u, a1,
a2, b1, b2, (ab)11, (ab)12, (ab)21, (ab)22). In practice, placing some constraints on the parameters
solves this. A common practice is to include a constant (u) term and force the sum of parameters
for each main effect and interaction to be 0 (here a1 = –a2). An alternative is to use what is called
the general linear model approach, which uses an indicator (0,1) coding. For example, since effect
(a) has two levels, the first, a1, will be estimated, and the second a2, will be fixed to 0 (this
technique is called aliasing). This latter approach is available by choosing Loglinear…General, or
Loglinear…Logit. Statistical tests for main effects and interactions, predictions and conclusions
would be identical under the two approaches (parameterizations). However, individual
coefficients are interpreted differently. For a more detailed look at these issues see the references
cited earlier. Also, for a complete discussion of the general linear model, see McCullagh and
Nelder (1989).
Relations among Loglinear, Logit Models, and Logistic

Regression
The simple loglinear model appearing above fits a linear model to the natural log of cell counts.
There is no declaration, nor expression of independent and dependent variables; the equation
states that the expected cell counts are a function of the main and interaction effects of the two
variables in the table. However, if there is interest in viewing one of the variables as a function of
one or more of the others, a variant called a logit model is available. Recall from the discussion of
odds in the logistic regression chapter that the odds of something happening are equal to (p/(1-p),
where p is the probability of the event occurring. Looking at the first row of Figure 8.1 we see
that for people originating from the North, 3092 preferred the North and 958 chose the South
(total N of 4050). The odds can be expressed as a ratio of probabilities or as a ratio of cell counts.
The odds of a Northerner choosing the North are:
Odds (Choosing North by a Northerner) = (p/(1-p))

= (3092/4050) / (958/4050) or (n11/n12) = (3092/958) = 3.23.
A Northerner is about 3 times more likely to choose to be stationed in the North than in the South.
We see that odds can be expressed as a ratio of counts. Recall that loglinear models provide
equations relating expected counts to parameters. Tying these features together, if we express the
dependent variable in an odds form (actually the natural log of the odds), then we have an
equation relating the log odds of the dependent measure as a linear function of parameters
relating to the independent variables. In this way a variation of loglinear analysis can
accommodate a predictive model with a dependent measure relating to independent variables; this
variation is called logit analysis.
Speaking generally, the loglinear results of a data set can be converted to logit results, and vice
versa. In loglinear analysis there is interest in all relations among variables. Logit analysis with its
declared dependent variable focuses on the relationship between the independent variables and
the dependent measure. So in logit analysis there is minimal interest in how the independent
variables relate among themselves. If a loglinear analysis is run in which one variable is
considered a dependent measure, then relations not involving that measure are ignored. If a logit
analysis is directly applied, then relationships among the predictor variables are fit (in order to fit
the table of counts), but are not tested (the Genlog procedure reports them as “constants”). In
summary, loglinear analysis models relationships in a multi-way table of counts. Logit analysis is
a special case of loglinear modeling in which a dependent measure is declared, the relations
between the dependent measure and the predictors are the focus, and results are usually expressed
as odds.
Earlier in this course we reviewed logistic regression models. Recall in logistic regression a
dependent measure is declared, and each observation may have a unique set of values for its
predictor variables since they are interval scale. For this reason, logistic regression models are fit
to individual observations. For logit models (in logit models a dependent measure is declared) we
generally assume all variables are categorical (although there are ordinal and covariate type
models) so many individuals will share the same set of values. For example, in Figure 8.1 there
are hundreds of individuals who share the same data values (they fall into the same cell). As a
result, loglinear and logit models are fit to cell counts while logistic regression models are fit to
individual data. Logit and logistic regression models share the same goal: prediction of a
dependent variable from independent variables, but differ in the assumptions made about the
measurement scale of the predictor variables. What if we took the data from this chapter and ran
it under logistic regression with the predictor variables declared as categorical? Except for
estimation method differences, we would obtain the same results.
What to Look for in Loglinear and Logit Analysis

When running loglinear or logit analysis we are interested in assessing fit and determining which
relationships are statistically significant, that is, which variables are related to which others in the
population. For the significant effects we want to describe the nature of the relationships, obtain
parameter estimates to quantify them, and develop a prediction equation. Also, although we won’t
examine it in this chapter, analysis of residuals can be done to identify the cells that are worst fit
by the model.
Assumptions
Loglinear models assume that the distribution of the counts follows a multinomial or Poisson
distribution and for logit models a multinomial distribution is assumed. The decision between
multinomial and Poisson hinges on whether the sampling is constrained (are subgroups of fixed
size sampled). Agresti (1996) has some discussion of this question. The model is assumed to be
linear in the scale of natural logs of the cell counts. There is also an assumption of conditional
independence, meaning that the observed responses are independent of each other after adjusting
for the effects in the model. For example, if I could appear multiple times in the data, my
observations are likely to correlate more highly than if individuals were independently sampled
each time. The tests performed in loglinear and logit models are asymptotic tests, which means
the chi-square approximation improves as sample size increases relative to the number of cells. A
conservative rule of thumb sometimes mentioned is to have at least 25 observations in each cell,
but many studies fall short of this. As exact statistical test algorithms become more efficient, this
will be less of an issue.
Procedures in SPSS that Run Loglinear or Logit

Analysis
SPSS contains three procedures that perform loglinear or logit analysis. All are contained in the
SPSS Advanced Models option. Of related interest, there are procedures that run binary and
multinomial logistic regression (covered in Chapters 3 and 4) and ordinal regression (see brief
discussion below). Two of the three procedures are available using the menu system.
Model Selection
The Model Selection procedure provides very useful information about the level of complexity of
relations among variables (any two-way, three-way, four-way interactions?), performs
significance tests on effects in the model, and produces coefficients for loglinear or logit analysis.
The limitation is that Model Selection can only fit hierarchical models. Specifically this means
that if a higher order interaction is included in the model, then all derivative lower level
interactions and main effects must be included as well. Thus Model Selection cannot fit a model
with an A by B by C interaction, but no A by B interaction. This restriction permits a relatively
fast estimation method (iterative proportional fitting) to be used, but it does mean that some
models cannot be fit using this procedure. Minimally, it provides much helpful information to
guide you in running the more general loglinear procedures.
General Model /Logit

Loglinear procedures can fit loglinear and logit models, and are not restricted to hierarchical
models. It produces goodness of fit tests and parameter estimates for the specified model. It
employs a general linear model parameterization so parameter estimates will not match those of
the Hilog or Loglinear procedures, although these differences can be reconciled. User specified
linear contrasts can be performed and covariates (in the limited sense of adjusting based on the
cell mean values of the covariates) can be included in the analysis. It can better adjust the analysis
when the data table contains structural zeros (cells that should be excluded from the analysis due
to logical impossibility or irrelevance to the hypothesis under investigation) and is recommended
for this reason.
Ordinal Regression
Ordinal regression is appropriate when the dependent measure is ordinal. It fits a model to the
cumulative probabilities of the categories in the ordinal dependent variable using categorical and
interval predictors. A general procedure (a form of the general linear model, see McCullagh and
Nelder (1989)), it also supports loglinear-type models through a logit link function. For more
information, see the SPSS Advanced Models manual and the Ordinal Regression case study in the
SPSS Help system. Ordinal Regression is a choice under the Regression menu.
Model Selection Example: Location Preference

To illustrate loglinear analysis we will use data collected by Stouffer et al. (1949) in a study
entitled “The American Soldier,” a survey of U.S. soldiers done in the wake of the Second World
War. We will examine relations among four variables:
preferen In which region (North or South) the soldier prefers to be stationed

race Race of soldier (White or Black)
origin Region of origin (North or South)
current Region (North or South) the soldier is currently stationed in
Each variable is dichotomous, which makes the interpretation of the coefficients a bit simpler, but
the analysis allows for more than two levels for each of the variables. If we consider Preference to
be a dependent variable to be predicted from the other three then we would perform a logit
analysis.
First we will run the Model Selection procedure to examine any relationships, but will focus on
those involving Preference since that was of special interest. We are specifically interested in
knowing which variables in what combination significantly relate to Preference, and then wish to
interpret and quantify these relationships.
Setting Up the Model Selection Analysis

The first step is to read the data.
Double-click on Stouffer.sav
Click on View…Value Labels to display value labels (if necessary)
Figure 8.3 Aggregated Stouffer Data
The data file is in an aggregated form; each row of data corresponds to a cell in the analysis. For
example, the last row represents 870 whites born in the south, stationed in the south, and who
preferred to be stationed in the south. The reason for this format is that the information was only
available as a published table. A message in the right side of the SPSS status bar (just visible)
indicates that weighting is on; each row will be weighted by the value of the wt variable when
analyses are run. This was accomplished by turning weighting on (click Data…Weight Cases)
before the SPSS data file was saved. You will typically work with individual data and the setup is
identical to what we do here.
Click Analyze…Loglinear…Model Selection

Move current, origin, preferen and race into the Factor(s) list box
The question marks appearing beside each variable indicate that the minimum and maximum
integer values determining the range must be supplied for each variable. Notice there is no
separate declaration of dependent and independent variables; relations among all variables will be
examined. The Cell Weights list box does not control the case weighting we discussed, but
instead allows you to differentially weight cells in the analysis. This is usually done in order to
instruct the program to ignore certain cells (they are given 0 weight), which is done if there are
logically impossible cell combinations (for example, people in their 30s who first married in their
50s). By default, Model Selection will run a backward elimination testing procedure. This begins
by fitting the full model (preferen, origin, race, and current, with all main effects and
interactions), then eliminating the most complex interaction if it is not significant. Then the next
most complex effect is tested and dropped if not significant. The process continues until only
significant effects (and their derivative effects; if there is an A by B interaction, the A and B main
effects are kept) remain in the model. This would tell us how complicated a model is required to
fit the data. We should be able to deduce this from the partial association table.
Figure 8.4 Model Selection Loglinear Analysis Dialog
Select all variables in the Factor(s) list box

Click the Define Range button
Type 1 in the Minimum text box
Type 2 in the Maximum text box
Figure 8.5 Supplying the Range of Values to Include
With this information SPSS can calculate the memory requirements for the analysis without
reading the data file. If we didn’t know the minimum and maximum values we could have earlier
checked the value labels by right clicking on a variable name in a variable list box, then clicking
Variable Information on the Context menu.
Click Continue
Figure 8.6 Loglinear Analysis: Model Dialog
By default, Model Selection will fit a saturated (or complete) model, that is, all main effects and
all possible interactions. You can supply your own customized model to be fit (perhaps dropping
high-order interactions). However, be aware that Model Selection is restricted to hierarchical
models, so if a high-order effect is specified (for example, a three-way interaction), its lower-
order effects (the relevant two-ways and main effects) must be included as well.
Click Cancel
Click Parameter estimates and Association table in Display for Saturated Model area
Figure 8.7 Loglinear Analysis: Options Dialog
The Model Selection procedure ordinarily displays a table of frequency counts and the residuals
from the final model. The partial association table provides significance tests for almost every
effect in the model, which is quite helpful. Similarly the parameter estimates quantify the
relationships and can be used in prediction equations. Even without requesting these options,
Model Selection will provide some information about how simple or complex a model is required
to fit the table.
Notice the Model Criteria area contains a Delta parameter set to .5. This is a constant added to
each cell in the table in order to avoid zero cell counts. Recall that the loglinear model is based on
the natural logs of the cell counts, and the natural log of zero is undefined (minus infinity). To
avoid this problem a small constant is added to each cell. If you have no empty cells, you can set
Delta to zero, but when there are no zero cells the Delta of .5 makes little material difference.
Click Continue
Figure 8.8 Completed Model Selection Loglinear Analysis Dialog
Setup is complete. The only feature we might have added would be to request that a model be
built not by backward elimination (the default), but in one step.
Click OK
Hint: If All Results Are Not Visible

If all output does not display in the Viewer window (you’ll notice a red arrow at the bottom of the
visible output from the procedure), simply right click on the output in the Contents pane, which
invokes the Context menu, then click SPSS Rtf Document Object…Open. This will open a text
editor window permitting easy scrolling through all of Model Selection’s results. We have
slightly edited the output for ease of display.
Figure 8.9 Data Information
First we see information concerning the sample size, and numbers of missing and out-of-range
observations. Each factor appears with its number of levels. Note the sample size is quite large.
This is typical for loglinear analysis.
In the Convergence Information table (not shown), the generating class is based on the four-way
interaction of variables (current*origin*preferen*race). The term “generating class” is a
shorthand way of describing a hierarchical model. Since in a hierarchical model the presence of a
complex effect implies the inclusion of the relevant simpler effects (A*B implies that A and B
must be included), then you need only state the most complex effects in order to describe the
model. We requested (the default) a saturated model, so the four-way interaction is the generating
class.
SPSS solves for the parameters in a a loglinear model through iteration, so you should look in the
Convergence Information table to make sure that the process converged.
Figure 8.10 Observed and Expected Frequency Table
In the Cell Counts and Residuals table, each row represents a cell in the analysis. For example,
the first row contains the count (387.5) of Blacks from the North, currently stationed in the North,
who prefer the North. As mentioned earlier, .5 is added to each cell to prevent problems with
empty cells. The expected counts are based on the requested model. Since a complete (saturated)
model was fit, the expected values perfectly match the observed counts and all residuals are zero.
This summary becomes more interesting if an incomplete model is requested, or when viewed
after backward elimination is performed.
Significance Tests
Goodness-of-fit tests comparing the model to the data are performed; these tests will be
nonsignificant when the predictions from the model closely match the observed data (because the
predicted cell frequencies will be close to the actual cell frequencies). Since a saturated model
was used the fit is perfect; the chi-squares are 0 with a probability of 1. This result is trivial, but
would be of interest if anything other than a complete model were applied.
Figure 8.11 Chi-Square Goodness-of-Fit Test
The K-Way and Higher Order Effects table contains two subtables (K-way and Higher Order
Effects and K-way Effects) that are quite valuable in determining how complex a model is needed
to fit the data. Pooled chi-square tests (both Pearson and Likelihood ratio) are performed to
inform you whether effects are significant.
The K-way and Higher Order Effects subtable reports tests for whether a level of interactions and
higher levels are zero. The K-way Effects subtable simply reports whether a specific level of
interactions is zero.
Of the two subtables, the K-way Effects provides a more direct summary, although the same
conclusions are derived from each. For example, in the “K-Way Effects subtable, the row where k
is 3 tests whether any three-way interactions are significant (they are because the Sig. value is
low, less than .00005). The corresponding row in the K-way and Higher Order Effects subtable
tests whether any three-way or four-way interactions are significant. From either table we would
conclude that there is no significant four-way interaction, but there is at least one three-way
interaction. However, the K-Way Effects subtable provides this information in a direct fashion. In
addition to the three-way interactions, we see there are strong two-way interactions (compare the
chi-square values for rows 2 and 3), and significant main effects. A main effect here simply
means there is not an even split across the two categories for a variable.
Figure 8.12 K-Way Effects and Higher-Order Effects
We know the order (degree of complexity) of the significant effects, since the partial association
table tells us just what they are.
Figure 8.13 Partial Association Tests
Partial association tests are performed for all but the highest-order interaction in the model (the
result we already know from the previous table). They are chi-square tests of each main effect
and interaction after adjusting for other effects of the same or lesser complexity in the model. For
example, the test of the preference by origin interaction (two-way interaction) would compare
differences between the goodness of fit chi-square based on a model of all main effects and two-
way interactions excluding preference by origin, to the chi-square with it included. If we focus
our attention on the preference variable, we see that race, origin and current location all
significantly relate to it (origin most strongly). In addition, preference is involved in a three-way
interaction with origin and current location.
Thus partial association tests show which specific relations are significant.
Coefficient Interpretation
We know from the partial association test results which effects are statistically significant. The
next step is to quantify the relationships. The parameter estimates provide this information, but to
use them effectively either a hand calculator or your computer’s calculator is necessary. It is
important to be aware of what the parameters represent when interpreting the coefficients. The
SPSS Advanced Models manual provides more detail, but we will say here that if there are two
levels to a variable, the default parameters relating to that variable in the Model Selection
procedure reference the first category of the variable. This fact is necessary to interpret the
parameter estimates.
Rather than examine all significant findings, we will focus on those involving preference since it
is of most interest to us. Let’s begin with the main-effect parameters at the bottom of the output.
Figure 8.14 Parameter Estimates
For each effect the parameter estimate, standard error, z statistic and 95% confidence band are
displayed. The z statistic or confidence band can be used to determine statistical significance
(absolute value of z greater than 1.96 is significant at the .05 level). We see the z statistic for
preference (6.96) is above 1.96 and thus significant, indicating that other things being equal, there
is not an even 50/50 split between choosing the North versus South. Since the first category
(North) is the reference category in this parameterization, the positive sign of the estimated
coefficient (.125) indicates that the North is preferred. To attach a number to interpret this for
preference, we must double the coefficient and raise it to the e (antilog function) power. The
doubling is due to scaling involved in the parameterization and raising the result to the e power
converts from the natural log scale back to the original units. Thus the North is preferred by a
factor of e(.125 * 2) or 1.28; another way to express this is to say that other things being equal the
North is a little more likely to be chosen, or the odds of choosing the North are 1.28 to 1. The
other main-effect coefficients can be similarly interpreted.
Two-Way Interactions
Figure 8.14 also displays all two-way interactions involving preference. Looking at the preference
by origin interaction (.616) we see it is highly significant. Recall that the first category for each
variable is referenced by the parameter, and North is the first category value for both preference
and origin. Thus Northerners are e(2 * .616), or about 3.45 times as likely to prefer the North
compared to the base rate mentioned earlier (odds of 1.28 to 1). This is slightly higher than the
odds we calculated for the two-way table of origin and preference (see Figure 8.1); the odds
changed because race and current location are taken into account. Similarly the estimate for the
interaction between current location and preference is .370, so other things being equal, those
stationed in the North are e(2 * .370), or about twice (2.09) as likely to prefer the North. Finally, the
race by preference interaction estimated coefficient is .183 and the first category is black. Thus
other things being equal, blacks prefer the North by a factor of e(2 * .183), about 1.44. The other
two-way interactions can be interpreted in the same way.
At this point we know those from the North prefer the North, those stationed in the North prefer
the North, and Blacks prefer the North. Are any of these interpretations qualified by three-way
interactions?
Three-Way Interactions
Not all the three-way interactions shown in Figure 8.14 are significant (examine the z statistics)
and we are interested in relations involving preference. The only significant three-way interaction
involving preference is the preference by origin by current interaction. We know that both origin
and current location are related to preference. However, the three-way interaction implies that the
combined effect of originating in the North and being stationed in the North, as it relates to
preference, is not simply the combination of the two effects. The coefficient (–.08) is negative,
which means that simply combining the estimated effects for being from the North and being
stationed in the North overstates the overall effect. The combined effects of the two-way
interactions, a Northerner stationed in the North preferring the North, evaluated above (3.45 *
2.09) must be reduced by e(2 * -.078) or .86. Thus the odds shrink from 7.2/1, down to 6.2/1.
The estimated coefficients quantify relations among the categorical variables and can be used in
prediction equations.
Backward Elimination Results

When Backward Elimination is selected, nonsignificant effects are eliminated from the model
(here a saturated model), beginning with the most complex effect (here the four-way interaction)
and moving to less complex effects. The goal is to identify a simple model that fits the data. Note
that if a significant interaction is present (say the three-way effect A*B*C) then the backward
elimination algorithm will not examine the lower-level effects (A*B, B*C, A*C) within the
original generating class, and thus they will be retained even if non-significant. This is a result of
the hierarchical constraint applied to the model by the procedure. Rather than examine each step
in the backward elimination process, we will summarize the early steps and view the final result.
Figure 8.15 Backward Elimination Results
In the first step (Step 0) in Figure 8.15, the most complex effect (four-way interaction) was tested
for statistical significance, found to be non-significant, and was dropped from the model. In the
second step (Step 1), the preference by race by current location interaction was dropped, and in
the third step, the preference by race by origin interaction was dropped.
In the fourth step (Step 3), there are three candidates for deletion, but the probability (Sig.)
column indicates that each is statistically significant, so no further effects can be removed using a
likelihood ratio chi-square change test.
In the last step, SPSS lists the final model, which fits the data well, since the significance is well
above .05 (.694). The model contains a preference by origin by current interaction, which we
found to be significant earlier. It also includes the race by origin by current interaction, which
isn’t of interest to us since it does not involve preference but needs to be included to fit the data in
the table. Finally, the preference by race interaction, which we also reviewed, is retained. We
should note that the two-way interactions of preference by origin and preference by current
location would also be included in the model since they are within the generating class of
preference by origin by current, as would the main-effects. In short, backward elimination
provides an automated means of eliminating non-significant effects from the model, although it
doesn’t provide as much individual effect information as the parameter estimates and their
significance tests.
Following this, SPSS provides a Cell Counts and Residuals table and Goodness-of-Fit Tests
tables for the final model, which we will not review further, except to note that the cell residuals
can be used to investigate where the model fits poorly.
Figure 8.16 Cell Counts and Residuals
Figure 8.17 Goodness of Fit
Appendix: Logit Analysis with Specific Model (Genlog)

Rather than fitting a complete (saturated model) using the logit procedure, we will construct a
custom model based on the results from the Model Selection procedure. We will test whether our
model fits the data and illustrate how to set up a custom model. We will attempt to predict
preference with origin, race, current location, and the origin by current location interaction.
Recall these effects are only considered as they relate to the dependent variable.
Setting up the Analysis

To explicitly request a logit model in SPSS:
Click Analyze…Loglinear…Logit
Move preferen into the Dependent list box
Move current, origin and race into the Factor(s) list box
Figure 8.18 Logit Dialog Box
We have identified the dependent and predictor variables. You can incorporate an interval-scale
covariate into the analysis to equate the cells on some other measure with the Cell Covariates list
box. In this case each cell’s mean value for the covariate(s) would be calculated and adjustments
would be based on these values. This differs from analysis of covariance in an ANOVA context
in that there the adjustment is based on individual differences on the covariate; here we are fitting
cells in a table and covariance adjustment is applied at this level. The Cell Structure list box
permits you to supply cell weights, usually to eliminate logically impossible or irrelevant cells
from the analysis (structural zeros, discussed in the Model Selection section). Finally, you can
supply contrasts applied to the cells in the table that perform specific tests of hypotheses. These
contrasts are represented as variables, and are created using the Compute dialog box.
The Save button allows you to save predicted values and various types of residuals; these latter
summaries are useful in identifying cells poorly fit by the model.

Click the Custom option button
Move (separately) current, origin and race into the Terms in Model list box
Click current and Shift-click on origin (to select both)
Move the current * origin interaction into the Terms in Model list box
Figure 8.19 Building a Custom Model
Note that the preference variable does not appear in the Factor(s) and Covariates list box. This
may seem surprising, but recall that we requested a logit model with preference as the dependent
measure. As a result the Logit procedure automatically includes preference in the model. In
addition, all the effects appearing in the Terms in Model list box are related to preference. In
other words, because we are running a logit model, what seems to be the main effect of current
location in the Factor(s) list box is actually fit as the current by preference interaction. We have
thus included all effects significantly related to preference from our earlier analysis. To properly
fit the table, the logit procedure also includes parameters involving the predictor variables and
relationships among them. However, since these effects are not relevant to relations involving the
dependent measure, they are merely listed as constants in the parameter estimate section, and
neither standard errors nor confidence bands will be calculated for them.
Click Continue
Click the Estimates check box
Figure 8.20 Logit Loglinear Analysis: Options Dialog
As with the Model Selection procedure, we request parameter estimates. The design matrix
provides information about the nature of the parameters being fit by the logit procedure. Notice
you can control the confidence interval that will be applied to parameter estimates. Various
residual plots can be requested to evaluate the fit and identify poorly-fit cells. Note that the Logit
procedure adds .5 (Delta) to each cell by default, to avoid problems with cells having zero counts.
Results
We will not discuss all the Genlog output, which is quite extensive. After the convergence table,
there is a test of our overall model, with all terms we have specified, in the Goodness-of-Fit Tests
table. There are two tests, using either a standard chi-square or a likelihood ratio chi-square. Both
tests usually yield identical results, as is the case here. Recall that if a logit (or loglinear) model
fits the data well, the significance will be well above .05 (because the expected cell counts will
closely match the actual cell counts). The significance values are both about .69, so our model
does fit.
Figure 8.21 Model Fit Tests
R-square measures in regression tap the degree of association between predictors and the
dependent measure. Attempts have been made to develop similar measures in the context of logit
models, and two (based on entropy and concentration) are presented in the Measures of
Association table. The difficulty with such measures in practice is that they don’t tap explained
variation directly and the intuition we have concerning what constitutes a substantial R-square
does not apply. Nonetheless, if you run several models these measures provide a sense of relative
improvement. For discussion of the use and interpretation of them see Haberman (1982)
Figure 8.22 Measures of Association
The next table shows the observed and expected counts based on our customized model. We are
interested in seeing how well the predicted counts match the observed counts. Viewing the cell
residuals will directly assess this. Because the table is so wide, Figure 8.23 shows the first half,
with observed and expected counts, and Figure 8.24 contains the cell residual statistics.
We can see that, generally, our model fits the data quite well. The observed counts are quite close
to the expected counts.
Figure 8.23 Table of Observed and Expected Counts
In the second half of the table in Figure 8.24, we see the raw residual (observed count minus
expected count), the standardized residual (residual divided by its standard error), an adjusted
residual (the standardized residual divided by an estimated standard error), and a deviation
residual (it measures the cell's contribution (signed) to the chi-square goodness of fit, another way
to identify poorly fit cells). None of the standardized or adjusted residuals is large (close to 1.96),
so there are no cells that are especially poorly fit.
Figure 8.24 Cell Residual Summaries
Next we turn to the Parameter Estimates table (see Figure 8.25). Here, Genlog displays parameter
estimates, standard errors, z statistics and confidence bands for the parameters (non-aliased) in
the model.
Earlier we briefly discussed that Genlog takes a general linear model approach to defining
parameters. There is a parameter for each possible effect, and constraints are applied to avoid
redundancy and permit estimation. For example, examine the Preference effect (the preferen=1
and preferen=2 parameters). There are two levels of preference so there is one degree of freedom
available, and so just one parameter can be estimated. Genlog will estimate preferen=1 (which
represents the effect of being in the first category—North) and alias the preferen=2 parameter by
setting it to zero (note the footnote mentioning that the parameter is redundant). A similar
approach is taken for other parameters, including interactions.
Additionally, constant terms will be included to fit each combination of the independent
variables, which are not of interest, but must be included for model estimation.
The actual interpretation of each parameter is more complicated than in the Model Selection
procedure because of the difference in effect coding. As you can see by glancing through the
table, there are many aliased parameters. This is a necessary result of the approach taken.
Examining the table, you can see that every term we requested is incorporated in the model. The
first eight parameters simply account for relations among the independent variables and are not of
interest. In a loglinear model, these relations would be of interest and more information would
appear concerning them.
Figure 8.25 Parameter Estimates for Model
The parameters can be used to obtain predicted values. Note these parameter estimates do not
require doubling, as did those in the Model Selection procedure, before being raised to the e
power. In relatively simple analyses (two-way tables) the parameters represent simple functions
of the expected cell counts and can be interpreted in a direct way. In more complicated tables,
such as ours, they represent more complex combinations of expected cell counts and a simple
interpretation is more problematic. For those desiring to directly interpret the parameter estimates,
McCullagh and Nelder (1989) have a more complete and technical discussion of the parameters
used in the general linear model and Agresti (2002) has a detailed discussion of many loglinear
models.
In addition, several residual plots are produced. Since they are examined in the same way, as they
would be when running analysis of variance, we will not discuss them in this context.
Chapter 9
Multivariate Analysis of Variance
Topics
• Why Perform MANOVA?
• Assumptions
• What to Look for in MANOVA
• An Example: Memory Influences
• Examining the Output
• Post Hoc Tests
Introduction
Multivariate analysis of variance (MANOVA) is a generalization of analysis of variance that
permits testing for mean differences on several dependent measures simultaneously. In this
chapter we will explore the rationale and assumptions of MANOVA, review the key summaries
to examine in the results, and then step through an analysis looking at group differences on
several memory measures.
Why Perform MANOVA?

Multivariate analysis of variance (MANOVA) tests for population group differences on several
dependent measures simultaneously. Instead of examining differences for a single outcome
measure (as analysis of variance does), MANOVA tests for differences on a set or vector of
means. The outcome measures (dependent variables) are typically related; for example, a set of
ratings of employee performance, multiple physiological measures of stress, several scales
assessing an attitude, a collection of fitness measures, multiple scales measuring a product's
appearance, several measures of the fiscal health of a company. MANOVA is typically performed
for one of two reasons: statistical power, and control of false positive results (also known as Type
I error).
First, MANOVA can provide greater statistical power, which is the ability to detect true
differences, in a multivariate analysis. The argument is that if you have several imperfect
measures of an outcome, for example, several physiological measures of stress, the joint analysis
will be more likely to show a true difference in stress than any individual analysis. A multivariate
analysis compares mean differences across several variables and takes formal account of their
correlation. In this way a small difference appearing in several related outcome variables may
result in a significant multivariate test, although no single outcome measure shows a significant
difference. This is not to say there is a power advantage in throwing 20 unrelated variables into a
multivariate analysis of variance, since a true difference in a single outcome measure can be
diluted in a joint test involving many variables that display no effect. However, if you are
Multivariate Analysis of Variance 9 - 1

interested in studying group differences in outcomes for which various measures exist (this
occurs in marketing, social science, medical, ecological and engineering studies), then MANOVA
probably carries greater statistical power. The scatterplot below displays two groups of
observations measured on three related outcome variables.
Figure 9.1 Two Groups Compared on Three Outcome Measures
Visually the two groups are distinct when viewed in the space defined by the three outcome
measures. It is also clear that the three measures are related: the shape from this viewing angle is
an ellipse. The group differences do not align perfectly along any single variable (x1, x2, x3), but
collectively the group differences are clear. We would expect a multivariate test of group
differences to be statistically significant. This provides an illustration of how a multivariate test
can provide greater statistical power when the outcome measures are related.
The second argument for running MANOVA in place of separate univariate (single outcome
variable) analyses concerns controlling the false positive rate when multiple tests are done. If a
separate ANOVA is run for every outcome measure, each tested at the .05 level, then the overall
(or experiment-wise) false positive rate (chance of obtaining one or more false positive test
results) is well above 5 in 100 (5%) because of the multiple tests. A MANOVA applied to the
outcome measures would result in a single test performed at the .05 level. Although there are
certainly alternative methods of controlling the false positive rate when multiple tests are
performed (for example, Bonferroni adjustments), using a multivariate test accomplishes this as
well. Some researchers follow the procedure of first performing a multivariate test, and only if
significant would they examine the individual univariate test results. This provides some control
over the false positive rate. It is not a perfect solution (it is similar to the argument for the LSD
multiple comparison procedure) and has received some criticism (see Huberty 1989)
MANOVA Assumptions
The assumptions made when performing multivariate analysis of variance are largely extensions
of those made under ordinary analysis of variance. In addition to the usual assumptions for a
linear model (additivity, independence between the error and model effects, independence of the
errors), MANOVA testing assumes that the residuals (errors) follow a multivariate normal
distribution in the population; this is a generalization of the normality assumption made in

ANOVA. In SPSS you can examine and test individual dependent variables for normality within
each group. This is not equivalent to testing for multivariate normality, but is still quite useful in
evaluating the assumption. In addition, homogeneity of variance, familiar from ANOVA, has a
multivariate extension concerning homogeneity of the within group variance-covariance matrices.
A multivariate test of homogeneity of variance (Box’s M test) is available to check this
assumption.
For large samples, we expect departures from normality to make little difference. This is due to
central limit theorem arguments combined with the fact that in MANOVA we are generally
testing simple functions of group means. If the samples are small and multivariate normality is
violated, the results of the analysis may be influenced. Data transformations (for example, logs)
on the dependent measure(s) may alleviate the problem, but have potential problems of their own
(interpretation, incomplete equivalence between tests in the transformed and untransformed
scales). Unfortunately, a general class of multivariate nonparametric tests is not currently
available; developments in this area would provide a solution.
Concerning homogeneity of variance, in practice if the sample size is similar across groups then
moderate departures from homogeneity of the within group variance/covariance matrices do not
affect the analysis. If homogeneity does not hold and the sample size varies substantially across
groups, then test results can be influenced. In the simplest scenarios, the direction of the effect
depends on which sized group has the larger variances, but specific situations can be far more
complex, in which case little can be generalized.
What to Look for in MANOVA

After investigating whether the assumptions are met, primary interest would be in the multivariate
statistical tests. If significant effects are found you might then examine univariate results or
request a dimension reduction analysis. Additionally, you might perform post hoc comparisons to
discover just where the differences reside.
SPSS Multivariate Procedure Differences

The Advanced Models module within SPSS includes a General Linear Model (GLM) procedure
in addition to the MANOVA procedure. Besides pivot table output, the GLM procedure has
several desirable features from the perspective of MANOVA: 1) Post hoc tests on marginal
means (univariate only), 2) Type 1 through Type 4 sums of squares available (greater flexibility
in handling unbalanced designs/missing cells), 3) Multiple Random Effect models can be easily
specified, 4) Residuals, predicted values and influence measures can be saved as new variables.
However, the MANOVA procedure contains several useful advanced functions: 1) Roy-
Bargmann step-down tests (testing for mean differences on a single dependent measure while
controlling for the other dependent measures); 2) Dimension reduction analysis and discriminant
coefficients. These latter functions provide information as to how the dependent variables
interrelate within the context of group differences (for a single main effect analysis, this is
equivalent to a descriptive discriminant analysis—see Chapter 2).
In short, while we expect the SPSS GLM procedure will be your first choice for multivariate
analysis of variance, the MANOVA procedure can contribute additional information. Please note
that MANOVA can only be run from syntax.

Example: Memory Influences

Winer et al. (1991) describe an experiment looking at the memory effects of different
instructions. Three groups of human subjects learned nonsense syllables and were administered
two memory tests: recall and recognition. The first group of subjects was instructed to like or
dislike the syllables as they were presented (in an attempt to generate affect or emotion). A
second group was instructed that they would be tested on their knowledge of the syllables
(inducing anxiety?). The third group was told to count the syllables as they were presented
(interference). The experimenters were interested in assessing group differences in memory, with
memory being measured by both recall and recognition tests. We first perform the basic
multivariate analysis.
Move to the c:\Train\Advstat directory if necessary
Double-click on Memory.sav
Click Analyze…General Linear Model…Multivariate
We must specify the dependent measure(s) and at least one factor. The Multivariate dialog box
contains list boxes for the dependent variables, factors and covariates. The term “Fixed Factor(s)”
in the Multivariate dialog box reminds us that the factors are assumed to be fixed: levels of the
factor(s) used in the analysis were chosen by the researcher (not randomly sampled) and cover the
range to which population conclusions will be drawn. The Multivariate dialog box also permits a
weight variable to be incorporated in the analysis (performs weighted least squares). Although
rarely used in multivariate analyses (when used it is typically for univariate analyses), it adjusts
the analysis based on different levels of precision (or heterogeneity of variance) for different
individuals or groups.
Move recall and recog into the Dependent variable(s): list box
Move group into the Fixed Factor(s): list box
Figure 9.2 Multivariate Dialog

The Multivariate dialog box contains several buttons that represent useful features; a Plot button
which produces for each dependent measure a profile plot displaying group means; a Post Hoc
button that performs post hoc tests on the marginal means (for multivariate analyses, each
dependent variable is analyzed separately); a Save button permits you to save predicted values,
residuals and influence measures.
The Contrast button permits you to test planned comparisons (contrasts) of interest for any
between-group factor. You can use the Model button to specify the design (the factor main effects
and interactions that constitute the analysis) of your study. By default, a complete model, also
known as a saturated model (testing all main effects and interactions), is assumed, but those
performing analyses of incomplete designs (Latin squares, partial replicates, etc.) would use the
Model dialog box to indicate which model effects should be estimated. You can use the Options
dialog box to request means, homogeneity tests, correlation matrices and other diagnostic
information pertaining to the analysis. Here we will request means and a test of homogeneity;
keep in mind that since there is the same number of subjects in each level of group, the
homogeneity assumption is of less concern (and that is one reason why experiments try to obtain
equal numbers of subjects in each factor level).

Move group to the Display Means for: list box
Click the Homogeneity tests check box
Figure 9.3 Multivariate: Options Dialog
Moving the group variable into the Display Means for list box will result in estimated means,
predicted from the chosen model, for the three groups. These means can differ from the observed
means if covariates are specified or if an incomplete model (not all main effects and interactions)
is used. If no covariates are present (our situation), then post hoc analyses can be applied to the
estimated marginal means using the Post Hoc button. The Compare main effects check box can be
used to have SPSS test for significant differences between every pair of estimated marginal

means for each of the main effects (here only group) in the Display Means for list box. Note that
by default, a significance level of .05 (see the Significance level text box) is applied to each test;
this can be controlled by adjusting the value. In addition Bonferroni or Sidak adjustments can be
requested (Confidence interval adjustment drop-down list) to control the overall Type I error
(false positive rate) when many pairwise comparisons are made.
The Display area contains options that display supplemental information. We requested that
homogeneity of variance tests be performed. Spread vs. level plots display group variability as a
function of the central tendency of the group. As such they can be useful in assessing the source
of heterogeneity of variance. Residual Plots can be used to identify outliers. Checking
Descriptive statistics will cause means, standard deviations and counts to appear for each cell
(subgroup) in the analysis. Usually you would request these; however in our analysis the
estimated means are identical to the observed means, and we will request a table and plot of the
former. If Estimates of effect size is checked, then partial eta-square values will print for each
effect (main effects, interactions). Eta-square is equivalent to the r-square in regression; the
partial eta-square measures the proportion of variation in the dependent measure that can be
attributed to each effect in the model after adjusting for the other effects. Parameter estimates are
the estimates for the coefficients in the model. Typically, they would be requested if you wanted
to construct a prediction equation. The various sums of squares matrices are computational
summaries and not interpreted directly.
The Observed power check box will produce an analysis of the statistical power of your study to
detect effects of the magnitude observed in the sample.
The Significance level text box allows you to specify the significance level used to test for
differences in the estimated marginal means (default .05) and the confidence intervals around
parameter estimates (default .95).
Click Continue
Figure 9.4 Multivariate: Model Dialog

For many analyses the Model dialog box is not used. This is because by default a full factorial
model (all main effects, interactions, covariates) is fit and the various effects tested using Type III
sums of squares (each effect is tested after statistically adjusting for all other effects in the
model). If there are any missing cells in your analysis, you might switch to Type IV sums of
squares, which better adjusts for them. If you are running specialized factorial designs that are
incomplete (by plan every possible combination of factor levels is not present), or in which there
are no replicates (interaction effects are used as error terms), you would click the Custom option
button in the Specify Model area and indicate which main effects and interactions to include in
the model. A custom model is sometimes used if there is no interest in testing high order
interaction effects. Since we have a single factor there is no need to modify this dialog.
Click Cancel
Click the Contrasts button
The Contrasts dialog is identical for both multivariate and univariate analyses. You would use it
to specify main effect group comparisons of interest, for which parameter estimates can be
obtained and tests performed. In the statistical literature, these contrasts are sometimes called
planned comparisons. For example, in an experiment in which there are three treatment groups
and a control group there is very specific interest in testing each experimental group against the
control. One of the contrast choices (Simple) allows this. Several types of contrasts are available
within the dialog box, and using syntax you can specify your own (Special). To request a set of
contrasts, select the factor from the Factor(s) list box, select the desired contrast from the Contrast
drop-down list, and click the Change button. By default, the procedure applies no contrasts; it
uses a (0,1) indicator-coding scheme. Since we have no specific planned contrasts that we wished
to apply to the Group variable, we will exit the Contrast dialog box.
Figure 9.5 Multivariate: Contrasts Dialog
Click Cancel
Move group into the Horizontal Axis: list box
Click the Add button

Figure 9.6 Multivariate: Profile Plots Dialog
The Profile Plots dialog box is available for the General Linear Model procedures and produces
profile plots as line charts displaying means at different factor levels. You can view main effects
with such plots, but they are most helpful in interpreting two and three-way interactions (note that
up to three factor variables can be included). The dependent variable(s) does not appear in this
dialog box; for multivariate analyses, separate plots are produced for each dependent measure.
Multiple plots can be requested, which is useful in complex analyses where there may be several
significant interactions. Although not necessary in this example, we requested a profile plot.
Click Continue
Figure 9.7 Multivariate: Save Dialog

The Save dialog box allows you to save predicted values, and various types of residuals and
influence measures, as new variables in the data file. Examining them might identify outliers and
influential data points (data points whose exclusion substantially influences the analysis). Such
analyses are standard for serious practitioners of regression and can be applied in this context. In
addition, the coefficient statistics (coefficient estimates, standard errors, etc.) can be saved to an
SPSS data file (in matrix format) and can be manipulated later (for example, to apply coefficients
to generate predictions for future cases). Although we strongly recommend an examination of the
residuals, with the limited time available to us in this class, we skip this step.
Click Cancel
The Post Hoc button is used to request post hoc comparisons on the observed subgroup means.
Post hoc tests test for significant differences between every possible pairing of levels of a factor.
Since many tests may be involved, most post hoc methods adjust the significance criterion based
on the number of tests in order to control the false positive error rate (Type I error). Usually post
hocs are performed after a significant main effect is found (in the initial analysis), and we will
visit this dialog later in the chapter.
Click OK
Note:
Some of the output pivot tables have been edited (using the Pivot Table editor) to be more easily
readable.
Examining the Output

The first output describes the factors in the analysis. They are labeled between-subject factors;
this is appropriate because the three learning groups were composed of different individuals. We
will see within-subject analysis of variance (repeated measures) in the next chapter.
Figure 9.8 Between-Subject Factors Summary
There is one factor with three levels and there are 12 individuals per group (equal sample sizes).
Two of the next three pivot tables provide information about the homogeneity of variance
assumption. Box’s M tests for equality of covariance matrices (since there is more than a single
dependent measure) across the different groups. Levene’s homogeneity test is a univariate test
and is applied separately to each dependent measure.
Box’s M statistic generalizes the homogeneity of variance test to a multivariate situation, testing
for equality of group variances (as univariate homogeneity tests would) and covariances (which
univariate tests cannot) for all dependent measures in one test. Box’s test is not significant (.611),
indicating no group differences in the covariance matrices made up of the dependent measures.

As a univariate statistic, Levene’s test is applied to each dependent measure. The recall measure
is consistent with the homogeneity assumption (sig. = .263), while recognition (recog) does show
group differences in variance (sig. = .046). This is a modest departure from homogeneity. Given
that Box’s M is not significant and that the sample sizes are equal, we are not concerned about the
Levene result for recog. We will now view the multivariate significance tests.
Figure 9.9 Box’s M and Levene's Test of Homogeneity of Variance
Tests in multivariate ANOVA, just as in univariate ANOVA, are based on ratios of between-
group to within-group variation. However, in the multivariate case we are dividing two matrices
and there is no single number that represents the ratio of two matrices. As a result, several
multivariate tests have been developed, usually based on different aspects of the between-group
to within-group matrix ratio. There are four multivariate test statistics commonly applied: Pillai’s
criterion, Hotelling’s Trace criterion, Wilk’s Lambda, and Roy’s largest root. The first three give
identical results in a two-group analysis, but can differ in more complex analyses. They all test
the null hypothesis of no group mean differences in the population. Results of Monte Carlo
simulations focusing on robustness and statistical power suggest that, under general
circumstances, Pillai’s test is preferred. However, there are specific situations, for example when
the dependent measures are highly related (forming a strong core), that one of the others is the
most powerful test. As a general rule, if different multivariate tests give you markedly different
results, it suggests something about the dimensionality and type of group differences. For an
accessible discussion of this see Olsen (1976).

Figure 9.10 Multivariate Analysis of Variance Table
The upper part of the table tests whether the overall mean (Intercept) differs from 0 (zero) in the
population. It is not interesting since all it indicates is that people remember something. We will
focus on the tests of group differences.
The Value column displays the sample value of each of the four multivariate test statistics. They
are converted to F statistics (F column) and the associated hypothesis (Hypothesis df) and error
(Error df) degrees of freedom follow. These four columns are technical summaries; we are
primarily interested in the significance values that appear under the “Sig.” heading. Here we see
that every multivariate test indicates there are significant differences among the three groups. All
indicate that the probability of finding sample means (for recall and recognition) as far, or further
apart, as we observe here is very small (.001 or less than 1 in 1,000) if there are no population
differences. Given this we are next interested in looking at whether both recall and recognition
show differences (univariate tests), and which groups differ from which others.
We now examine the test results for each dependent measure.
Figure 9.11 Univariate Test Results
Although both dependent measures appear in this table, the results are calculated independently,
and are identical to what you would obtain if the analysis were run separately on each dependent

measure (univariate ANOVA). Thus we find whether both of the dependent measures showed
significant group differences. The sums of squares, df (degrees of freedom), mean square, and F
columns are what we would expect in an ordinary ANOVA (analysis of variance) table. We
described and disregarded the Intercept information in the multivariate summary. Moving to the
Group section, we see (Sig. column) that both recall and recognition showed significant group
differences, but that the F statistic (and therefore the relative magnitude) of the differences is
more pronounced for recognition.
The Error section summarizes the within-group variation. The Corrected Model section pools
together all model effects (excluding the intercept), and in this simple analysis is identical to the
Group effect. The Total pools everything in the analysis (including the error), while the Corrected
Total pools everything except the intercept. It should be noted that if the sample sizes are not
equal when multiple factors are included in the analysis, then under Type III sums of squares (the
default), the sums of squares for the totals will not generally be equal to the sums of their
component sums of squares.
Finally, R-square values (based on the corrected model) for each variable appear as footnotes.
Notice that the R-square for recognition (.441) is higher than that of recall (.167). This is
consistent with recognition having a larger F statistic for the group differences.
Figure 9.12 Estimated Marginal Means
Estimated marginal means are means estimated for each level of a factor averaging across all
levels of other factors (marginals), based on the specified model (estimated). By default SPSS fits
a complete model (all main effects and interactions), and in such cases these estimated means are
identical to the observed means. However, if a partial model were fit (for example, if multiple
factors are analyzed but high order interactions not included) then the estimated marginal means
will differ from the observed means. In our analysis, the estimated marginal means equal the
observed means and we see that for both recall and recognition scores, the group told of the test
has the lowest means. Standard errors and 95% confidence bands for the estimated marginal
means appear as well.
We can see these means graphically in the profile plots. The profile plot for recall appears below.

Figure 9.13 Profile Plot of Recall (Estimated Marginal Means)
The Profile plot displays the estimated means (identical to observed means here) for the three
groups. Since we have but one factor this plot is not especially interesting. Profile plots are much
more useful when more factors are involved since they can represent up to three factors. The plot
for recognition is quite similar and is not shown.
Post Hoc Tests

At this point of the analysis it is natural to ask just which learning groups differ from which
others. The Multivariate (GLM) procedure in SPSS will perform separate post hoc tests on each
dependent variable in order to investigate this question. Post hoc tests are usually performed to
investigate which levels within a factor differ after an overall main effect has been established.
SPSS offers many post hoc tests, and the basic idea behind post hoc testing is that some
adjustment of the Type I (false positive or alpha) error rate must be done due to the number of
comparisons made. In our example, only three tests need be performed (group 1 vs. 2, 2 vs. 3, and
1 vs. 3). However, if there were ten levels of instruction then there would be 10 * 9/2, or 45
pairwise tests, and the probability of one or more false positive results would be quite substantial.
We will request that Scheffe post hoc tests be performed. Scheffe tests are considered
conservative in that when testing at the .05 level, you are assured that the overall false positive
rate, including any contrasts you may perform among the means (of which all possible pairwise
comparisons form a subset), does not exceed .05. To perform the post hocs we return to the
Multivariate dialog box.
Click the Dialog Recall tool , then click Multivariate

Click the Post Hoc button
Move group into the Post Hoc Tests for: list box
Click the Scheffe check box

Figure 9.14 Multivariate: Post Hoc Dialog
We have requested Scheffe post hoc tests for the group factor means. If you run a multifactor
study and find several main effects significant, post hoc tests can be applied to each factor.

Figure 9.15 Scheffe Post Hoc Tests for Recall and Recognition
We see that for both recall and recognition, every possible group pairing appears. The Mean
Difference column shows the difference in sample means between the two groups, and the
Standard Error column contains the standard error of the difference between the means. The
standard errors are identical for all group differences involving recognition (and similarly for
those involving recall) because all groups have the same sample size and share the same error
term. The Sig. Column contains the significance value when the Scheffe test is applied to the
group differences. Finally a 95% confidence band, based on the Scheffe calculation, is displayed
for the difference in group means. There is some redundancy in the table; note the “like or
dislike” group is paired with the “count syllables” group, and the “count syllables” group is
paired with the “like or dislike group.”
Examining the table we see that for the recognition dependent variable the “told of test” group
had significantly lower scores than either of the other two groups. For recall, the “told of test”
group had significantly lower scores than the “like or dislike” group. Thus the results are not
identical for the two dependent measures. However, we can say that for both memory measures
the “told of test” group did significantly worse than the “like or dislike” group. Also there is some
evidence that the “told of test” group might differ from the “count syllables” group in memory.
In this way post hoc tests provide clarifying detail concerning a significant main effect.
Extensions
In this example we performed post hoc tests since we didn’t have specific group comparisons in
mind before analyzing the data. If you do have such planned comparisons, they can be performed
if you request the comparisons using the Contrast button. This will yield univariate planned
comparison results. If you require multivariate planned comparisons, they can be requested using
syntax on the GLM LMATRIX subcommand. The command below requests two planned
comparisons, comparing the first to the second group, and the second to the third group.

GLM
recall recog BY group
/METHOD = SSTYPE(3) /INTERCEPT = INCLUDE
/CRITERIA = ALPHA(.05)
/LMATRIX “Group 1 vs. 2” GROUP 1 -1 0
/LMATRIX “Group 2 vs. 3” GROUP 0 1 -1
/DESIGN .
Here we specify two planned contrasts of the Group factor means. The first (Group 1 vs. 2)
compares means for Group 1 and 2 (note the coefficients 1, –1 and 0 are assigned to the three
groups). The second (Group 2 vs. 3) compares Groups 2 and 3 (note the coefficients 0, 1 and –1
are assigned to the three groups). Note that if more than one between-group factor is involved the
LMATRIX subcommand can accommodate a more complex model, but that the contrast
specification in turn is more complex. The SPSS Syntax Guide, which can be installed along with
SPSS, contains more information on the LMATRIX subcommand.

Chapter 10
Repeated Measures Analysis of Variance
Topics
• Why Do a Repeated Measures Study?
• The Logic of Repeated Measures
• Assumptions
• Example: One Factor Drug Study
• Further Analyses
• Repeated Measures with Missing Data
• Appendix: Ad Viewing with Pre-Post Brand Ratings
Introduction
Repeated measures analysis of variance (ANOVA) involves testing for significant differences in
means when the same observation appears in multiple levels of a factor. Such analyses make
slightly different assumptions than between-group ANOVA and the error terms are calculated
differently. In this chapter we will discuss the reasons for performing repeated measures ANOVA
and the important assumptions. We will then run a simple analysis and examine the results in
order to illustrate the method. We then demonstrate repeated measures analysis when there are
missing values for a repeated measures factor. In the appendix we offer a more complex repeated
measures analysis.
Why Do a Repeated Measures Study?

Repeated Measures (also called within-subject) studies are used for several reasons. First, by
using a subject as her own control a more powerful (greater likelihood of finding a real
difference) analysis is possible. For example, consider testing a single subject with two drugs
compared to testing two individuals, each with only one of the drugs. By testing the same subject
twice instead of different people each time, the variability attributable to person-to-person
differences is reduced when comparing the two means, which should provide a more sensitive
analysis (of drug differences, in this case). A second reason in practice is cost reduction;
recruitment costs are less if an individual can contribute data to multiple conditions.
However, repeated measures studies have potential problems. Since an individual appears in
multiple conditions there may be practice, fatigue, or carryover effects. Counterbalancing the
order of conditions addresses the carryover problem, and the different trials or conditions are
often well spaced to reduce the practice and fatigue influences.
Examples of repeated measures analysis include:
Repeated Measures Analysis of Variance 10 - 1

1. Marketing – Compare customer’s ratings on four different brands, or different products,

for example four different perfume fragrances.
2. Medicine – Compare test results before, immediately after, and six months after a
procedure.
3. Education – Compare performance test scores before and after an intervention program.
4. Engineering – Compare output from different machines after running for 1 hour, 8 hours,
16 hours and 24 hours.
5. Agriculture – The original research area for which these methods were developed.
Different chemical treatments are applied to different areas within a plot of land (split-
plots) and crop yield results are compared.
6. Human Factors – Compare performance (reaction time, accuracy) under different
environmental conditions. For example, examine pilot accuracy in reading different types
of dials under varying lighting conditions.
For an accessible introduction to repeated measures analysis with a number of worked examples,
see Hand and Taylor (1987). For more technical and broad (beyond ANOVA) discussions of
repeated measures analysis see Lindsey (1993) or Crowder and Hand (1990).
The Logic of Repeated Measures

In the simplest case of repeated measures analysis two values are compared for each observation.
For example, suppose for each individual we record a physiological measure under two drug
conditions. We can obtain sample means for each drug condition and want to determine whether
there are significant differences between the drugs in the larger population. One direct way to
approach this would be to compute a difference or change score for each individual, obtained by
subtracting the two drug measures, and testing whether the mean difference score is different
from zero. We illustrate this in the spreadsheet below.
Figure 10.1 Difference Scores With Two Conditions

Subject Drug 1 Drug 2 Difference
1 30 28 2
2 14 18 -4
3 24 20 4
4 38 34 4
5 26 28 -2
Means 26.40 25.60 0.80
S.D. 3.63
We see a difference score is calculated for every individual and these scores are averaged
together. If there were no drug differences then we would expect the average difference score to
be about zero. To determine if the population mean difference score is different from zero, we
need some measure of the variability of sample mean difference scores. We can obtain such a
variability measure by calculating the variation of individual difference scores around the sample
mean difference score. If the sample mean difference score is so far from zero that it cannot be
accounted for by the variation of individual difference scores, we say there is a significant
population difference. This is what a paired t test does.
The analysis becomes a bit more complex when each subject (unit of analysis) appears in more
than two conditions, meaning there are more than two levels of a repeated measures factor. Now
no single difference score can summarize the differences. We illustrate this below.

Figure 10.2 Difference Scores with Four Conditions

Subject Drug 1 Drug 2 Drug 3 Drug 4 Difference Difference Difference
1 verses 2 2 versus 3 3 versus 4
1 30 28 16 34 2 12 -18
2 14 18 10 22 -4 8 -12
3 24 20 18 30 4 2 -12
4 38 34 20 44 4 14 -24
5 26 28 14 30 -2 14 -16
Means 0.80 10.00 -16.40
S.D. 3.63 5.10 4.98
Although no one difference score summarizes all drug differences, we can compute additional
difference scores, and thus account for drug effects. The number of these differences, or contrasts,
is equal to the degrees of freedom (one less than the number of levels in the factor). For two
conditions, only one contrast is possible; for four conditions, there are three; for k conditions, k-1
contrasts are required. If assumptions of repeated measures ANOVA are met, these differences,
or contrasts between conditions, can be pooled to provide a significance test for an overall effect.
We used simple differences to compare the drug conditions (drug 1 minus drug 2, drug 2 minus
drug 3, etc.). There are many other contrasts that could be applied. For example, we could have
calculated drug 1 minus the mean of drugs 2, 3 and 4; and then drug 2 versus the means of 3 and
4; and finally drug 3 versus 4. As long as the assumptions of repeated measures are met, the
specific choice of contrasts doesn’t matter when the overall test is calculated. However, if you
have planned comparisons you want tested, then you would request those.
In each example, we wound up with one fewer difference variable than the original number of
conditions. Another variable is calculated in repeated measures, which represents the mean across
all conditions. It is used when testing effects of between-group factors, having averaged across all
levels of the repeated measure factor(s). This mean effect is shown in the figure below.
Figure 10.3 Mean Score Added to Difference Scores

Subject Drug 1 Drug 2 Drug 3 Drug 4 Mean Across Difference Difference Difference
Four Drugs 1 verses 2 2 versus 3 3 versus 4
1 30 28 16 34 27 2 12 -18
2 14 18 10 22 16 -4 8 -12
3 24 20 18 30 23 4 2 -12
4 38 34 20 44 34 4 14 -24
5 26 28 14 30 25 -2 14 -16
Means 0.80 10.00 -16.40
S.D. 3.63 5.10 4.98
The mean score across drug conditions for each subject is recorded in the mean column. As
mentioned above, any tests involving only between-group factors (for example, sex, age
category) would use this variable.
This idea of computing difference scores or contrasts across conditions for each subject, then
using the means and subject-to-subject variation as the basis of testing whether the average
contrast value is different from zero in the population, is the core concept of repeated measures
ANOVA. Once you become comfortable with it, the rest falls into place. SPSS performs repeated
measures ANOVA by computing contrasts across the repeated measures factor's levels for each
subject, and then testing whether the means of the contrasts are significantly different from zero.

A matrix of coefficients detailing these contrasts can be displayed and is called the transformation
matrix.
Assumptions
Repeated Measures analysis of variance has several assumptions common to all ANOVA. First,
we assume the model is correctly specified and additive. Secondly, that the errors follow a normal
distribution and are independent of the effects in the model. This latter assumption implies
homogeneity of variance when more than a single group is involved. As with general ANOVA,
moderate departures from normality do not have a substantial effect on the analysis, especially if
the sample sizes are large and the shape of the distribution is similar from group to group (if
multiple groups are involved). In multi-group studies, failure of homogeneity of variance is a
problem unless the sample sizes are about equal.
In addition to the standard ANOVA assumptions, there is one specific to repeated measures when
there are more than two levels to a repeated measures factor. If a repeated measures factor
contains only two levels, there is only one difference variable that can be calculated, and you
need not be concerned about the assumption. However, if a repeated measures factor has more
than two levels, you generally want an overall test of differences (main effect). Pooling the results
of the contrasts (described above) between conditions creates the test statistic (F). The assumption
called sphericity tests when such pooling is appropriate. The basic idea is that if the results of two
or more contrasts (the sums of squares) are to be pooled, then they should be equally weighted
and uncorrelated. To illustrate why this is important, view the spreadsheet below.
Figure 10.4 Scale Differences and Redundancies in Contrasts

Subject Drug 1 Drug 2 Drug 3 Drug 4 Diff1 Diff 2 Diff 3
1 verses 2 100*(2-3) 1 versus 2
1 30 28 16 34 2 1200 2
2 14 18 10 22 -4 800 -4
3 24 20 18 30 4 200 4
4 38 34 20 44 4 1400 4
5 26 28 14 30 -2 1400 -2
Means 0.80 1000.00 0.80
S.D. 3.63 509.90 3.63
The first contrast variable represents the difference between drug 1 and drug 2 (Drug 1 – Drug 2).
However, the second is 100 times the difference between Drug 2 and Drug 3. It is clear from the
standard deviation values of the second difference variable that this variable would dominate the
other difference variables if the results were pooled. In order to protect against this, normalization
is applied to the coefficients used in creating the contrasts (each coefficient is divided by the
square root of the sum of the squared coefficients).
Also, notice that the third contrast is a duplicate of the first. Admittedly, this is an extreme
example, but it serves to make the point that since the results from each contrast are pooled
(summed), then any correlation among the contrast variables will yield incorrect test statistics. In
order to provide the best chance of uncorrelated contrast variables, the contrasts or
transformations are forced to be orthogonal (uncorrelated) before applying them to the data.
This combination of normalization and forcing the original contrasts to be orthogonal

(uncorrelated) is called orthonormalization. Again, when actually applied to the data, these
properties may not hold, and that is where the test of sphericity plays an important role.

This combination of assumptions, equal variances of the contrast variables and zero correlation
among them, is called the sphericity assumption. It is called sphericity because a sphere in
multidimensional space would be defined by an equal radius value along each perpendicular
(uncorrelated) axis. Although contrasts are chosen so that sphericity will be maintained, when
applied to a particular data set, sphericity may be violated. The variance-covariance matrix of a
group of contrast variables that maintain sphericity would exhibit the pattern shown below.
Figure 10.5 Covariance Matrix of Contrast Variables when Sphericity Holds

Diff 1 Diff 2 Diff 3
Diff 1 V 0 0
Diff 2 0 V 0
Diff 3 0 0 V
The diagonal elements represent the variance of each contrast when applied to the data and the
off-diagonal elements are the covariances. If the sphericity assumption holds, the variances will
have the same value (represented by V) and the covariances will be zero.
A test of the sphericity assumption is available. If the sphericity assumption is met then the usual
F test (pooling the results from each contrast) is the most powerful test. When sphericity does not
hold, there are several choices available. Technical corrections (Greenhouse-Geisser, Huynh-
Feldt) can be made to the F tests (adjusting degrees of freedom) that modify the results based on
the degree of sphericity violation. An alternative is to take a multivariate approach, in which the
contrasts are tested simultaneously, while taking explicit account of the correlation and variance
differences. The difficulty in choosing is that no single method has been found (in Monte Carlo
studies) to be best under all conditions examined. Also, the test for sphericity itself is not all that
sensitive. For a summary of the various approaches and a suggested strategy for testing, see
Looney and Stanley (1989).
If a particular pattern is present in the error covariance matrix, the SPSS Mixed procedure (which
fits models containing mixtures of fixed and random factors) provides a way of formally
adjusting the test results. For example, if time is a repeated measures factor, there may be a first
order autocorrelation pattern in the errors, meaning that errors one time unit apart have r
correlation, while those two time points apart correlate r2, those three time points apart correlate
r3, etc. In this situation, the analysis will fit a model that takes into account the specified pattern in
the error covariance matrix and results based on this will appear, in addition to goodness-of-fit
measures. A variety of covariance patterns can be specified and the fit statistics aid in deciding on
the best model.
Example: One-Factor Drug Study

Our first example will be a sample data set from Winer et al. (1991) for a study in which five
subjects were tested in each of four drug conditions. Let’s assume the dependent measure is a
short-term memory test. Although the data set is small, which limits the statistical power of the
analysis, we want to determine whether there are differences among the drug conditions.
Double-click on RepMeas1.sav

Figure 10.6 Data for Simple Repeated Measures Analysis
We have only one group of subjects. Each subject has a memory score under the four drug
conditions. Notice all four of the drug measures are variables attached to a single case. If the four
measures for a subject were spread throughout the file, the analysis can still be run within SPSS,
but only by using the Univariate dialog box (or the Mixed Models…Linear menu choice).
Click Analyze…General Linear Model…Repeated Measures
Figure 10.7 Repeated Measures Define Factor(s) Dialog
Here we provide names for any repeated measures factors and indicate how many levels each has.
Unlike a between-group factor, which would be a variable (for example, region), a repeated
measures factor is expressed as a set of variables.
Replace factor1 with drug in the Within-Subject Factor Name text box
Type 4 in the Number of Levels text box

Figure 10.8 Defining a Repeated Measures (Within-Subject) Factor
We have defined one factor with four levels. In a more complex study (we will see one later in
this chapter) additional repeated measures factors can be added. The Measure area is used to
provide a label for the dependent measure in the results. Recall we named our four variables
drug1 to drug4 so there would be no ambiguity about which factor level each represented.
However, this choice of names doesn’t indicate that these variables all measure short-term
memory. You can supply such labeling information in the Measure area. We will do so here.
Type memory into the Measure Name text box

Figure 10.9 Providing a Dependent Measure Name (Label)
Since we have only a single dependent measure (memory), we have simply provided a label for it
to be used in the output. If we had named more than one measure here, then we would be
prompted for more variables to represent the additional measures taken at each factor level. This
analysis, repeated measures with multiple dependent variables, is sometimes referred to as
doubly-multivariate analysis of variance.

Click the Define button
In this dialog we link the repeated measures factor levels to variable names, and declare any
between-subject factors and covariates. Notice the Within-Subjects Variables box lists drug as the
factor and provides four lines labeled with level numbers 1 through 4. We must match the proper
variable to each of the factor levels. This step should be done very carefully since incorrect
matching of names and levels can produce an incorrect analysis (especially if more than one
repeated measure factor is involved).
Figure 10.10 Repeated Measures Dialog
Move drug1, drug2, drug3, drug4 into the Within-Subjects Variables box, in that order
Click drug4 to select it
The variable corresponding to each drug condition is matched with the proper drug factor level.
Since drug4 is selected, the up-arrow button is active. These up- and down-arrow buttons will
move variables up and down the list, so you can easily make changes if the original ordering is
incorrect. We have neither between-subjects factors nor covariates and can proceed with the
analysis, but first let’s examine some of the available features.

Figure 10.11 Repeated Measures Factor Defined
By default a complete model (all factors and interactions) will be fit. As with procedures we saw
earlier in the course, a customized model can be fit for either between or within-subject factors.
This is usually done when specialty designs (Latin squares, incomplete designs) are run. The
Contrasts button is used to request that particular contrasts be applied to a factor (recall our
discussion of difference or contrast variables earlier). The Plots button generates profile plots that
graph factor level estimated marginal means for up to three factors at a time. Such plots are
powerful tools in understanding interaction effects. Since we have only a single factor we will not
request a plot. The Post Hoc dialog was discussed in Chapter 9, and it performs post hoc tests of
means for between-subject factors. We will look at the available test for repeated measures
factors shortly. The Save dialog allows you to save predicted values, various residuals and
influential point measures. Also, you can save the estimated coefficients to a file for later
manipulation (perhaps in a prediction model, or to compare results from different data sets). We
will explore the Option dialog more closely.

Click the Descriptive statistics check box
Click the Transformation matrix check box

Figure 10.12 Repeated Measures: Options Dialog
Estimated marginal means (discussed in Chapter 9) can be produced for any factors in the model.
Since we are fitting a complete model, the estimated marginal means are identical to the
estimated means. We have asked to see the transformation matrix. The transformation matrix
contains the contrast coefficients that are applied to the repeated measures factor(s) in order to
create the difference or contrast variables used in the analysis. Here we display it only to
reinforce our earlier discussion of this topic. Diagnostic residual plots are available, and there is a
control to modify the confidence limits (default is 95%). The SSCP (sums of squares and cross
products) matrices are not ordinarily viewed. However, they do contain the sums of squares for
each of the contrast variables. By viewing them you can verify that the overall test simply sums
up the individual contrasts' sums of squares, which is why sphericity is necessary. The other
check box options were discussed in Chapter 9.
Click Continue
Click OK
Examining Results
The first summary displays information about the factors in the model.

Figure 10.13 Factor Summary
There is only a single within-subject (repeated measures) factor and no between-subjects factors.
Figure 10.14 Descriptive Statistics
Means, standard deviations, and samples sizes appear for each factor level. If you were unsure of
your matching the variable names to factor levels in the Define Repeated Measures Factors dialog
box, you can compare these means to those you would obtain from the Descriptives or Means
procedures to insure the proper variables are matched with the proper factor levels. The memory
scores for the third drug group seem substantially lower than for the others.
Multivariate test results appear next. Since they would typically be used only if the sphericity
assumption fails, we will skip these results for now and examine the sphericity test.
Figure 10.15 Mauchly’s Sphericity Test
We see from the significance (Sig.) column that the data are consistent with the sphericity
assumption. The significance value is above .05, indicating that the covariance matrix of the
orthonormalized transformation variables is consistent with sphericity (diagonal elements
identical and off-diagonal elements zero in the population). This is a very small sample and we

certainly don’t have a powerful test of sphericity. Taking the result at face value, since sphericity
has been maintained we can use the standard (pooled) ANOVA results, and need not resort to
alternative (multivariate) or adjusted (called degree of freedom or epsilon adjustment;
alternatively models can be fit to specific error covariance structures in the Mixed procedure)
tests. The Epsilon section of the pivot table provides the degree of freedom modification factor
that should be applied if the sphericity result were significant. Let’s take a brief look at the
multivariate results.
Figure 10.16 Multivariate Tests
Remember these results need not be viewed since sphericity has been maintained. Here the test is
whether all of the contrast variables (representing drug differences in memory) are zero in the
population, while explicitly taking into account any correlation and variance differences in the
contrast variables. So if sphericity were violated these results could be used. Explanations about
the various multivariate tests were given in Chapter 9. The multivariate tests indicate there are
drug differences in memory.
Figure 10.17 Tests of Drug Effect
The Within-Subjects Effects table displays the standard repeated measures output based on
summing the results from each contrast (labeled "Sphericity Assumed") along with the sphericity-
corrected results. The test result is highly significant, more so than the multivariate test, which is
what we expect if sphericity holds: the pooled test is more powerful. We conclude there are
significant differences in memory across drug conditions.
If sphericity were violated we can see adjusted tests (degree of freedom/epsilon adjustment) in the
same table. As mentioned earlier, there are three variations. Although not necessary here, we
mention the Huynh-Feldt correction (Greenhouse-Geisser is overly conservative under some
conditions; see summary and references cited in Crowder and Hand (1990)).

The Huynh-Feldt corrected results do not visibly differ from the uncorrected results; this is
because there is very little departure from sphericity. If there were a substantial sphericity
violation, we would see differences.
Figure 10.18 Tests of Contrasts
Significance tests will be performed on each of the contrast variables used to construct a repeated
measures factor. Recall that by default, polynomial contrasts are used, and they, unfortunately,
have no relevance to our study. It makes no sense to examine linear, quadratic and cubic trend in
the context of four drug conditions. However, if the repeated measures factor were time or
dosage, then there would be interest in examining the contrast tests. Also, if there were planned
comparisons you wished to test, they can be requested through the Contrast dialog box, or in
syntax, and the results would appear in this section. To verify what these contrasts represent, we
can examine the transformation matrix (see below).
Figure 10.19 Tests of Between-Subject Effects
There were no between-subject factors in this study. If there were, the test results for them would
appear in this section. There is a test of the intercept, or grand mean; this simply tests whether the
average of all memory scores is zero. Although significant, it is not of interest.

Figure 10.20 Transformation Matrix
The transformation variables are split into two groups: one corresponding to the average across
the repeated measures factor, the others defining the repeated measures factor. The coefficients
for the Average variable are all .5, meaning each variable is weighted equally in creating the
Average transformation variable. If you wonder why the weights are not .25, recall that
normalization requires the sum of the squared weights to equal 1. Turning to the contrasts that
represent the drug effect, the three sets of coefficients are orthogonal polynomials corresponding
to linear, quadratic and cubic terms. Looking at the first (Linear) we see that there is a constant
increase (of about .447) in the value of each coefficient across the four drug levels. This
constitutes a linear trend. In a similar way, the second drug contrast (Quadratic) has two sign
changes (negative to positive, then positive to negative) over the drug levels; this constitutes a
quadratic effect. The SPSS Advanced Models manual has additional information about the
commonly used transformations.
Recall that the transformations are orthogonal; you can verify this for any pair by multiplying
their coefficients at each level of the factor, and summing these products. The sum should be
zero. For linear and quadratic we can calculate (–.671*.5 –.224*.5 +.224*5 + .671*5), which is
indeed zero.
Summary of Analysis
We wanted to test for drug differences in memory. After checking the sphericity assumption, we
viewed the standard repeated measures results and found significant differences in memory due to
drug condition. Our next step is to explore these differences more carefully. Also, in practice you
should examine the residuals from the analysis (request residual plots in Options dialog box); we
will not pursue this latter step.

Further Analyses
Having concluded there are memory differences due to drug condition, we now investigate which
conditions differ from which others. SPSS currently performs a limited set of post hoc tests on
repeated measures factors (LSD, Bonferroni, and Sidak). We will request all pairwise tests with
Bonferroni adjustments. To do this we return to the Options dialog box.
Click the Dialog Recall tool , then click Repeated Measures

Move drug into the Display Means for list box
Click the Compare Main Effects check box
Select Bonferroni from the Confidence interval adjustment drop-down list
Figure 10.21 Requesting Pairwise Tests with Bonferroni Adjustment
We have requested pairwise tests of all drug condition means. Note that the estimated marginal
means will be tested, but since we fit a complete model, these are identical to the observed means.
With the Bonferroni adjustment, we are assured the effect-wise error rate for tests will not exceed
.05.
Most of the output is identical to what we saw before. We will view only the results relevant to
the pairwise comparisons.

Since we are comparing each condition to every other, a new transformation matrix is used to
create four variables each equivalent to a different drug condition: the identity transformation.
Notice this is a separate analysis, done after the others have been completed with the original
transformation matrix.
Also, note the estimated marginal means match the observed means (not shown).
Figure 10.23 Pairwise Test Results
Each drug condition mean is tested against every other. Examining the table, we see drug
condition 3 is significantly different from condition 4, and drug 4 is significantly different from
drug 1. Looking at the means it is surprising that drug 4 differs from more conditions than drug 3;
this is due to greater person-to-person variability in differences involving drug 3 (as evidenced by
the standard errors).
The program will also run a multivariate ANOVA attempting to test the pairwise comparisons
simultaneously; this is of no interest to us (even if there were sufficient degrees of freedom).

Planned Comparison
Suppose we had some specific hypothesis about drug conditions that we wished to test. For
example, if the drug conditions differed in the level of an active ingredient and we thought that at
some threshold level it would suddenly have an effect, we might want to test each adjacent pair of
means: drug1 versus drug2, drug2 versus drug3, and drug3 versus drug4. The Contrast button
provides a variety of planned comparisons and customized contrasts can be input using syntax.

Click the Contrasts button
Click the Change Contrast drop-down list and click Repeated
Click the Change button
Figure 10.24 Requesting Planned Comparisons
Repeated contrasts will compare each category to the one adjacent. After selecting any contrast
on the Contrast drop-down list, right-click on the contrast to obtain a brief description of it. The
SPSS Advanced Models manual contains more details. Many contrast types can be requested from
the Contrasts dialog box. Also be aware that you can provide custom contrasts using the Special
keyword in syntax.
Again, most of the output is identical to the previous runs; we focus on the contrast tests and the
transformation matrix.
Figure 10.25 Tests of Contrasts

We see that the second and third contrasts are significant at the .05 level. The second (Level 2 vs.
Level 3) compares drug2 to drug3, and the third compares drug3 to drug4. This seems
inconsistent with the pairwise tests we just ran in which drug2 was not different from drug3.
However, recall that we performed Bonferroni corrections on those tests and with these
adjustments the second contrast (here with significance level .012) was not significant.
To confirm our understanding of the contrasts, we view the transformation matrix.
We see the contrasts do compare each drug condition to the adjacent one. The transformation
matrix is very useful in understanding and verifying which contrasts are being performed. These
contrasts are not orthogonal, and would not be used without modification (orthonormalization) in
the sphericity and pooled significance tests appearing earlier.
Repeated Measures with Missing Data

In our previous repeated measures analyses, if a subject had a missing value on any of the drug
variables, then that subject would be excluded from the analysis (listwise deletion). It would be
preferable to make use of the available information from each subject, and the SPSS Mixed
procedure will automatically do so. The Mixed procedure can do far more than repeated measures
analysis with missing values. It can perform some forms of hierarchical linear model (HLM)
analysis, run models with mixtures of random and fixed effects, and can model different error
covariance patterns in repeated measures analysis. In this example, we will use it to demonstrate
how to set up repeated measures ANOVA with missing values for a repeated measures factor, but
will not discuss the options that permit the other mentioned analyses.
Click File..Open…Data
Double-click on RepMeas2.sav

Figure 10.27 Data Setup for Repeated Measures with Missing Data
The data used in the previous examples has been restructured so there is one record per subject by
drug combination. In addition, notice that subject (id) 1 has a missing memory value for the drug
4 condition. When repeated measures data are organized this way (which is necessary if SPSS is
to use the data from subjects with missing values), we must specify subject (here id) as a factor in
the model since the case-to-case variation does not represent subject-to-subject variation.
Click Analyze…Mixed Models…Linear
Figure 10.28 Linear Mixed Models: Specify Subjects and Repeated Variables Dialog
We won’t need this dialog for our analysis. It would be used to specify random factors for
hierarchical linear models (HLM) and repeated measures analysis with complex error covariance
structures (available on Repeated Covariance Type drop-down list).
Click Continue

Move memory into the Dependent Variable list box

Move drug and id into the Factor(s) list box
Figure 10.29 Linear Mixed Models Dialog
Subject (here id) is specified as a factor. Fixed (here drug) and random (here id) factors can be
declared by clicking the Fixed and Random buttons.
Click the Fixed button

Select drug, then click the Add button
Figure 10.30 Linear Mixed Models: Fixed Effects Dialog
In addition to declaring the fixed factors, model information (interactions, nesting) can be
supplied in this dialog. Drug is the single fixed effect in this study.

Click Continue
Click the Random button
Select id, then click the Add button
Figure 10.31 Linear Mixed Models: Random Effects Dialog
Here we simply declare id as a random factor; unlike the GLM procedure, at least one random
factor must be explicitly declared in the Mixed procedure. In addition to random factor
declaration, random effects can be added to the model and a covariance pattern specified for
random effects (for example, in a repeated measures or HLM analysis).
Click Continue
Click the Descriptive Statistics check box (not shown)
We requested that descriptive statistics display; most of the options in the Mixed Models:
Statistics dialog box (not shown) concern displaying estimates for more complex mixed models
(for example, when a random factor is assumed to have a specific covariance pattern).

Figure 10.32 Descriptive Statistics with Missing Values (End of Table)
The bottom of the descriptive statistics summary table shows that subject 1 has only three
observations, not four; these three data values will be used in the analysis.
We will skip several of the other summaries (Model Dimension, Information Criteria (goodness-
of-fit)) and move directly to the significance tests for drug.
Scroll to the Type III Tests of Fixed Effects pivot table
Figure 10.33 Test of Drug Effect
Not surprisingly the results are very similar to the original analysis with complete data (Figures
10.17 and 10.19). The reduced denominator degrees of freedom value for the Drug effect (about
11) reflects the fact that an observation is missing (error degrees in Figure 10.17 were 12). In this
way repeated measures can be done in the presence of missing values, although we make the
important assumption that values are missing at random.
Appendix: Ad Viewing with Pre-Post Brand Ratings

A second example involves a more complex analysis but will be done with fewer variations. A
marketing experiment was devised to evaluate whether viewing a commercial produces improved
ratings for a specific brand. Ratings on three brands (on a 1 to 10 scale, where 10 is the highest
rating) were obtained from subjects before and after viewing the commercial. Since the hope was
that the commercial would improve ratings of only one brand (A), researchers expected a
significant brand by pre-post commercial interaction (only brand A ratings would change). In
addition, there were two between-subjects factors: sex and brand used by the subject. Thus the
study had four factors overall: sex, brand used, brand rated, and pre-post commercial. We view
the data below.

Double-click on Brand.sav
Figure 10.34 Data from Brand Study
Sex and user are the between-subjects factors. The next six variables pre_a to post_c contain the
three brand ratings before and after viewing the commercial.
Click Analyze…General Linear Model…Repeated Measures

Replace factor1 with prepost in the Within-Subject Factor Name text box
Type brand in the Within-Subject Factor Name text box
Type rating in Measure Name text box
Click the Add button in the Measure Name area

Figure 10.35 Two Within-Subject Factors Declared
SPSS now expects variables that comprise two within-subject factors. The order you name the
factors only matters in that SPSS will order the factor levels list in the next dialog so that the last
factor named here has its levels change most rapidly. Thus depending on how your variables are
ordered in the data, some factor orders make the later declarations easier.
Both prepost and brand are listed as within-subject factors. There are six rows, so a possible
combination of levels between the two factors is represented. Note that brand level changes first
going down the list. This was due to naming brand last when factors were defined. Defining the
factors in an order consistent with the order of variables in your data file can make this easier.

Move pre_a, pre_b, pre_c into the first three positions in the Within-Subjects
Variables list box
Move post_a, post_b, post_c into the last three positions in the Within-Subjects
Variables list box
Move sex and user into the Between-Subjects Factor(s) list box

Figure 10.36 Between-Subjects and Within-Subjects Factors Defined
We can proceed with the analysis, but first let’s request some options.
Click the Options button.

Click the Descriptive statistics check box
Individually move sex, user, prepost and brand into the Display Means for list box
Click the Homogeneity tests check box
Besides descriptive statistics, we request estimated marginal means (which equal the observed
means since we are fitting a full model) for each of the factors. Since there are several groups
involved in the analysis, we ask for homogeneity of variance tests.

Figure 10.37 Repeated Measures: Options Dialog
We proceed with the analysis. Contrasts can be applied to any factors in the same way as we saw
earlier in this chapter.
Examining Results
The factors in the analysis are listed along with the sample sizes for the between-subjects factors.
Figure 10.38 Factors in the Analysis

Figure 10.39 Descriptive Statistics -Beginning
Subgroup means appear separately for each of the repeated measures variables. Means for the
repeated measures factors can be seen in the estimated marginal means pivot tables, or viewed in
profile plots.
Tests of Assumptions
Although they don’t appear together in the output, we will first examine some assumptions of the
analysis. Concerning homogeneity of variance, the program provides Box’s M statistic and
Levene’s test. Box’s M is a multivariate homogeneity of variance statistic testing here whether
the variance-covariance matrices composed of the six repeated measures variables are equal
across the between-subject factor subgroup populations. Levene’s test is univariate and tests each
of the six repeated measures variables separately.
Figure 10.40 Box’s M Test of Homogeneity
Box’s M test is not significant, indicating that the data are consistent with the hypothesis of
homogeneity of covariance matrices (based on the six repeated measures variables) across the
population subgroups.

Figure 10.41 Levene’s Test of Homogeneity
Not surprisingly, the results of Levene’s test are consistent wth Box’s M. Box’s M test has the
advantage of being a single multivariate test. However, Box’s test is sensitive to both
homogeneity and normality violations, while Levene’s test is relatively insensitive to lack of
normality. Since homogeneity of variance violations are generally more problematic for
ANOVA, Levene’s test is useful.
Now let’s examine the sphericity assumption since this determines whether we simply view the
pooled ANOVA results, or move to multivariate or degree of freedom (epsilon) adjusted results
(and as noted beforehand, the Mixed procedure allows various error covariance structures to be
modeled).
Figure 10.42 Sphericity Tests
Notice that no sphericity test is applied to the prepost factor. This is because it has only two
levels, so only one difference variable is created, and there is no pooling of effects. The sphericity
test for brand is not significant (Sig. = .832), nor is the sphericity test for the brand by prepost
interaction (Sig. = .975). Thus the data are consistent with sphericity. As a result we will not view
the multivariate test results or the adjusted pooled results (Huynh-Feldt, etc.), and instead focus
on the standard results. If sphericity were violated, the Mixed procedure also permits various
error covariance structures to be fit to the data.

Recall that sphericity-adjusted results will appear in the "Tests of Within-Subjects Effects" table
and notice this table is quite large. Since the sphericity assumption has been met, we will edit the
"Tests of Within-Subjects Effects" table so it displays only the "Sphericity Assumed" results.
Double-click on the Tests of Within-Subjects Effects pivot table to invoke the Pivot
Table editor
Click Pivot…Pivot Trays (so the option is checked)
Click and drag the second row tray icon (pop-up label "Epsilon Corrections") from the
Row tray to the Layer tray
If necessary, click the right-arrow of the Epsilon Corrections tray icon until
"Sphericity Assumed" appears in the Layer label
Click outside the crosshatch border around the table to close the Pivot Table editor
This table contains all tests that involve within-subject factors; those involving only between-
subjects effects appear later. Looking at the significance (Sig.) column, we see a highly
significant difference for prepost commercial and a brand by user interaction. The prepost
commercial by brand effect is not significant, indicating that although the commercial may have
shifted ratings (prepost commercial is significant), it did not differentially improve the rating of
brand A, which was the aim of the commercial. We will view the means and profile plots to
understand the significant effects.
Figure 10.43 Within-Subject Tests
Note
We will not view the multivariate results nor the degree of freedom (epsilon) corrected results,
which are appropriate if sphericity is violated. Nor will we examine the tests of specific contrasts
since we had no planned contrasts, and polynomial contrasts over brand categories make no
conceptual sense.

Figure 10.44 Between-Subject Tests
Of the between-subjects effects, only sex shows a significant difference. Let’s take a look at some
of the means.
Figure 10.45 Means for Prepost Commercial and Sex
We see males give higher ratings than females and the post-commercial ratings are higher than
the pre-commercial ratings. It seems that the commercial was a success, but a success for all
brands, not just brand A, as hoped. The means for brand used and brand are similar for the three
groups (not shown) and not different by our statistical tests.
Profile Plots
To better view the interaction between user (brand used) and brand we will request a profile plot.

Move user into the Horizontal Axis list box
Move brand into the Separate Lines list box

Figure 10.46 Requesting a Profile Plot
As many as three factors can be displayed in a profile plot, and so three way interaction can be
investigated. Notice that multiple profile plots can be requested, which allows for many views of
your data.
Figure 10.47 Profile Plot of Brand Used by Brand Ratings
In the plot, brand levels 1, 2 and 3 correspond to brands A, B and C, respectively. The Brand used
by Brand interaction shows (as we surely would expect) that those who regularly use a particular
brand rate it more highly than the other brands. This effect is statistically significant (see the
brand*user term in Figure 10.43). Especially when there are many factor levels, or several factors
are involved, profile plots can be very helpful.
Note
To do further investigation, request profile plots for other factors and relate them to the results
above.

Summary of Analysis
The homogeneity and sphericity assumptions were met. We did not examine normality, but could
do so by requesting residual plots in the Options dialog box. We found that men gave higher
brand ratings than women, that post commercial ratings were higher, and that respondents rated
their own brand highest. The desired brand by prepost commercial interaction was not evident.
Extensions
The principles of this chapter and Chapter 9 are combined in what is called doubly multivariate
analysis of variance. In these analyses there can be one or more repeated measures factors and
several dependent measures taken. Such analyses occur in sensory testing (several wines rated on
various characteristics), marketing (several brands rated on related attributes), medicine (several
related physiological measures taken under each condition), and longitudinal studies (several
family related attitude measures recorded every six months).

References
Agresti, A. 2002. Categorical Data Analysis, 2nd ed. New York: Wiley.
Agresti, A. 1996. An Introduction to Categorical Data Analysis. New York: Wiley.
Aldenderfer, M. S. and R. K. Blashfield. 1984. Cluster Analysis. Thousand Oaks, CA: Sage.
Aldrich, J. H. and F. D. Nelson. 1984. Linear Probability, Logit, and Probit Models. Thousand
Oaks, CA: Sage.
Anderberg, M. R. 1973. Cluster Analysis for Applications. New York: Academic Press.
Andrews, F. M., Klem, L., Davidson, T.N., O’Malley, P.M.,and W. L. Rodgers. 1981. A Guide
for Selecting Statistical Techniques for Analyzing Social Science Data. Institute for Social
Research, University of Michigan: Ann Arbor.
Bishop, Y. V. V., Fienberg, S. E., and P. W. Holland. 1975. Discrete Multivariate Analysis. New
York: Wiley.
Chiu, T., Fang, D., Chen, J., Wang, Y., and Jeris, C. 2001. “A Robust and Scalable Clustering
Algorithm for Mixed Type Attributes in Large Database Environment.” In Proceedings of the
seventh ACM SIGKDD international conference on knowledge discovery and data mining, 263.
Crowder, M. J. and D. J. Hand. 1990. Analysis of Repeated Measures. London: Chapman and
Hall.
Draper, N.R. and H. Smith. 1998. Applied Regression Analysis, 3rd ed. New York, Wiley.
Everitt, B. S., Sabine Landau, and M. Leese. 2001. Cluster Analysis, 4th ed. London: Hodder
Arnold.
Fienberg, S. E. 1980. The Analysis of Cross-Classified Data. Oxford: Oxford University.
Green, P. E., F. J. Carmone, and S. M. Smith. 1989. Multidimensional Scaling: Concepts and
Applications. Boston: Allyn and Bacon.
Haberman, S. J. 1982. “Analysis of Dispersion of Multinomial Responses.” Journal of the

American Statistical Association. 77: 568-580.
Hand, D. J. 1981. Discrimination and Classification. New York: Wiley.
Hand. D. J., Daly, F., Lunn, A. D., McConway, K. J. and E. Ostrowski. 1994. A Handbook of
Small Data Sets. London: Chapman & Hall.
Hand, D. J. and C. C. Taylor. 1987. Multivariate Analysis of Variance and Repeated Measures.
London: Chapman and Hall.
References R - 1
Harman, H. H. 1967. Modern Factor Analysis, 2nd ed. Chicago: University of Chicago.
Hosmer, David W. and Stanley Lemeshow. 2000. Applied Logistic Regression, 2nd ed. New
York: Wiley.
Huberty, Carl. J. 1989. “Multivariate Analysis versus Multiple Univariate Analyses.”

Psychological Bulletin. 105, 302-308.
Huberty, Carl J. 1994. Applied Discriminant Analysis. New York: Wiley.
Kaiser, H.F. 1974. “An Index of Factorial Simplicity.” Psychometrika, 39, 31-36.
Kim, J. O. and C. W. Mueller. 1978. Introduction to Factor Analysis. Thousand Oaks, CA: Sage.
Klecka, W. R. 1980. Discriminant Analysis. Quantitative Applications Series. Thousand Oaks,

CA: Sage.
Klein, J. P. and M. L. Moeschberger. 1997. Survival Analysis: Techniques for Censored and
Truncated Data. New York: Springer Verlag.
Kleinbaum, David G. 1994. Logistic Regression. New York: Springer-Verlag.
Kleinbaum, David G. 1996. Survival Analysis. New York: Springer-Verlag.
Klockars, Alan J. and G. Sax. 1986. Multiple Comparisons. Newbury Park, CA: Sage.
Lachenbruch, P. A. 1975. Discriminant Analysis. New York: Hafner.
Lee, E. T. 1992. Statistical Methods for Survival Analysis. New York: Wiley.
Lindsey, J. K. 1993. Models for Repeated Measurements. Oxford: Clarendon Press.
Looney, S. W. and W. Stanley. 1989. “Exploratory Repeated Measures Analysis for Two or More
Groups.” The American Statistician. Vol. 43, No. 4, 220-225.
McCullagh, P. and J. A. Nelder. 1989. Generalized Linear Models, 2nd ed. London: Chapman &
Hall.
McLachlan, G. J. 1992. Discriminant Analysis and Statistical Pattern Recognition. New York:
Wiley.
Menard, Scott. 1995. Applied Logistic Regression Analysis. Quantitative Applications Series.
Thousand Oaks, CA: Sage.
Milliken, George A. and Dallas E. Johnson. 1984. Analysis of Messy Data, Volume 1: Designed
Experiments. New York: Van Nostrand Reinhold.
Olson, C. L. 1976. “On Choosing a Test Statistic in Multivariate Analysis of Variance.”

Psychological Bulletin. 83, 579-586.
References R - 2
Punj, G. and D. W. Stewart. 1983. “Cluster Analysis in Marketing Research: Review and
Suggestions for Applications.” Journal of Marketing Research. Vol. 20 (also reprinted in Green,
Carmone & Smith).
Stouffer, S. A., Suchman, E. A., Devinney, L. C., Star, S. A. and R. M. Williams, Jr. 1949. The
American Soldier: Adjustment During Army Life. Studies in the Social Psychology in World War
II, Vol. 1. Princeton: Princeton University.
Toothaker, Larry E. 1991. Multiple Comparisons for Researchers. Newbury Park, CA: Sage.
Tukey, John W. 1991. “The Philosophy of Multiple Comparisons.” Statistical Science. 6, 100-
116.
Winer, B. J., Michels, Keith M. and D. R. Brown. 1991. Statistical Principles in Experimental
Design (3rd ed.). New York: McGraw-Hill.
Zang, T.,Ramakrishnon, R., and M Livny. 1996. “BIRCH: An Efficient Data Clustering Method
for Very Large Databases.” Proceedings of the ACM SIGMOD Conference on Management of
Data. 103-114. Montreal, Canada.
References R - 3
References R - 4
Exercises
Chapter 2: Discriminant
1. Use the customer data (CSM.sav) and a stepwise method to predict buyyes using the same
set of independent variables used in the chapter. Compare your results to those we
obtained using forced entry.
2. Try to predict categories of the variable satisf_1, which measures overall customer
satisfaction on a four–point scale. How many classification functions are significant?
What variables are important predictors?
3. Use the Gss93.sav results and predict category membership in smkdrnk for two or three
cases in the file, using the Fisher coefficients. Compare your predicted values to those
obtained by SPSS to check your work.
Chapter 3: Logistic Regression

1. Using the Gss93.sav file, try to predict whether or not someone has given money to an
environmental group (grnmoney; reverse the coding of this variable to make it easier to
interpret). When you have a significant model, make sure you can interpret the effect of
the coefficients.
2. Using your model from step 1, try to predict whether or not the respondent gave money
for two or three cases in the file. Make sure your result matches that of SPSS.
3. Remove outlying or influential cases from the data for the model predicting low birth
weight (Birthwt.sav). Rerun the model and compare to the results in the chapter for any
differences. Which model would you prefer? Why?
Chapter 4: Multinomial Logistic Regression

In this exercise you will use the SPSS data file Gss94.sav. Respondents were asked whom they
voted for in the 1992 presidential election, and that will be the dependent variable.
1. Open the file Gss94.sav. Since we are interested in the top three candidates, open and run
the syntax stored in Exer4.sps. It recodes the presidential voting variable (pres92) so that
only the three main candidates (Clinton, Bush, and Perot) have non-missing values.
2. Run a multinomial logistic model with gender (sex) as the only factor. Examine the
results and try to interpret the coefficients.
3. For those with extra time: Run a crosstabs table of pres92 and sex. Using the calculator
utility on your computer, see if you can reproduce the ratios of the voting counts for the
candidate pairings for the two genders using the model coefficient (Exp(B)).
4. Run a multinomial logistic model with sex and degree as Factors to predict pres92.
Examine the results (significance tests, coefficients, and classification table). Which
effects are significant? Try to interpret the coefficients (Exp(B)). Evaluate the predictive
accuracy.
5. Add race as a factor and age as a covariate. Are they significant? If so, try to interpret the
coefficients. Compare the pseudo r-square and classification tables to those in step 4.
6. Examine additional demographic or attitudinal variables (Hint: use Utilities…Variables
or Utilities…File Info to obtain information about the variables) that you feel might be
related to voting preference. Use those you select in a new model with the significant
Exercises E - 1
variables from step 4. If the new variables you added are significant, compare the model
results to those obtained previously.
Chapter 5: Survival Analysis

1. The SPSS data file Ishmael.sav (appearing in Lee (1992), provided by Dr. Richard
Ishmael) contains survival data from a medical study in which resected melanoma cancer
patients were given one of two immunotherapy treatments (treat): Bacillus Calmette-
Guerin (BCG) or Corynebacterium parvum (C. parvum). Survival time was recorded in
months (time) and some cases were censored (status). The status variable indicates if
death occurred (1), or the case was censored (2). Perform a survival analysis and test for
survival differences between treatment groups.
2. In addition, sex and initial stage of illness (stage) were recorded. Use sex as a strata
variable in a Kaplan-Meier analysis. Do you obtain results consistent with your earlier
conclusions?
3. For those with extra time. Run Cox regression, including treatment (treat), gender, stage
and age in the analysis (note that stage can be declared as a categorical variable). Which
variables are significantly related to survival?
Chapter 6: Cluster Analysis

1. Open the SPSS data file Usage.sav (used in the K-Means example in Chapter 6). Use K-
means clustering to run a four-cluster solution. Produce a profile plot. Given the class
discussion of the original solution, can you interpret the four-cluster solution?
2. Run a Two-Step cluster analysis on the same data, treating the usage variables as
categorical cluster variables. How many clusters were automatically selected? Interpret
the clusters and compare them to other solutions (in Chapter 6 and in step 1).
3. For those with extra time: Using the same data run a three-cluster solution using Ward’s
method (Hierarchical Clustering). Interpret the three clusters. Compare the three-cluster
Ward’s solution to the three-cluster K-means solution in Chapter 6 and the TwoStep
cluster results.
Chapter 7: Factor Analysis

1. Use the data file Gss94.sav. There is a set of questions asking whether the respondent
would favor or oppose abortion in each of seven circumstances (for any reason, birth
defects, threat to mother’s health, etc.). Each of these variables begins with the prefix
“ab.” Although these items don’t form a full interval scale (two rank-ordered choices),
we are interested in investigating whether there is a single factor or attitude towards
abortion, or whether attitudes are more complicated. Run factor or principal components
analysis on this set of items with a varimax rotation. How many factors are required? Can
you devise names for them (or it)?
2. For those with extra time: Rerun the analysis using a different factor method. Are the
results from the two methods consistent? Try a different rotation; is the interpretation any
different? Create factor score variables and examine the means of the factor scores
broken down by region of the country (region). Are there any interesting patterns?
Exercises E - 2
Chapter 8: Loglinear Models

1. Use the 1994 General Social Survey file. Use Model Selection to run a loglinear analysis
of sex, whether the respondent smokes (smoke) and whether the respondent drinks
(drink). Where are the significant relationships and can you quantify any of them? Which
effects does Backward Elimination drop from the model?
2. For those with extra time: Run the model derived from step 1 using Genlog in either
loglinear or logit form (depending on how you view the relationships among the
variables, i.e., is there a dependent variable). Does your model fit the data? Add an
additional variable of your choice to the analysis. Does it influence the results?
Chapter 9: Multivariate Analysis of Variance

1. The goal is to examine the relationship between education, handgun ownership and
region. The 1994 General Social Survey contains a question about the presence of a
handgun (pistol) in the house. We hope to relate this variable and region of the country
(region) to “household formal education.” One problem with the handgun question is that
it isn’t known just who owns the gun. Since the handgun is linked to the household, but
not an individual, we will attempt to tap “household education” by analyzing both
respondent’s education (educ) and spouse’s education (speduc) simultaneously. Thus run
a two-factor multivariate analysis of variance with region and handgun ownership (pistol)
as the factors and the two education measures as the dependent variables. Note that only
married couples will have valid values for both education variables so the conclusions
apply only to that population. Are there education differences related to having a pistol or
to region?
2. For those with extra time: If there are differences, investigate them using post hoc tests,
if appropriate.
3. For those with more extra time: Limit the analysis to region but add father’s education
(paeduc) and mother’s education (maeduc) as additional dependent variables.
Chapter 10: Repeated Measures Analysis of Variance

1. The SPSS data file Pilot.sav contains data modeled after a human factors study in which
the accuracy with which pilots read different types of dials under different environmental
conditions was investigated. Pilots were assigned to one of two noise conditions (a
between-subject factor). Each pilot read three different types of dials (for example,
digital, thermometer, and arrow indicator) at three time points (after 10, 20 and 30
minutes of being exposed to the noise). The dependent variable is an accuracy score:
higher numbers mean greater accuracy. Perform an analysis of variance to determine
which factors influence accuracy. Check assumptions when possible. The nine variables
corresponding to the three dial readings at three time points have the following names:
p1d1, p1d2, p1d3, p2d1,…, p3d3, where p refers to the time period and d refers to the
dial.
Exercises E - 3

Adv Stats 14

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adv Stats 14

Uploaded by

Copyright:

Available Formats

Advanced Statistical

Analysis Using SPSS®

V14.0 Revised 02/01/06 sg/mr/ls/ss

Advanced Statistical Analysis Using SPSS

Advanced Statistical Analysis Using SPSS

CHAPTER 1 INTRODUCTION AND OVERVIEW

CHAPTER 2 DISCRIMINANT ANALYSIS

CHAPTER 3 BINARY LOGISTIC REGRESSION

CHECKING CLASSIFICATIONS .................................................................................................3-15

CHAPTER 4 MULTINOMIAL LOGISTIC REGRESSION

CHAPTER 5 SURVIVAL ANALYSIS

CHAPTER 6 CLUSTER ANALYSIS

CHAPTER 7 FACTOR ANALYSIS

CHAPTER 8 LOGLINEAR MODELS

CHAPTER 9 MULTIVARIATE ANALYSIS OF VARIANCE

CHAPTER 10 REPEATED MEASURES ANALYSIS OF VARIANCE

REFERENCES ......................................................................................... R-1

Introduction and Overview 1 - 1

Table 1.1 Statistical Methods Examining Relationships

Independent Categorical - Dependent Categorical

Independent Interval - Dependent Categorical

Introduction and Overview 1 - 2

Independent Categorical - Dependent Interval

Independent Interval - Dependent Interval

Exploring Relations and Natural Groupings

Table 1.2 Exploring Relations and Natural Groupings

Introduction and Overview 1 - 3

Introduction and Overview 1 - 4

1. Deciding whether a bank should offer a loan to a new customer.

2. Determining which customers are likely to buy a company’s products.

How Does Discriminant Analysis Work?

The Elements of A Discriminant Analysis

The Discriminant Model

FK = D0 + D1X1 + D2X2 + ... DpXp

Discriminant creates a linear combination of the predictor variables to calculate a discriminant

How Cases Are Classified

1. Maximum likelihood or probability methods: These techniques assign a case to group k if

Assumptions of Discriminant Analysis

Note on Course Data Files

A Note about Variable Names and Labels in Dialog Boxes

Within the General tab sheet of the Options dialog box:

Click the Display names option button

A Two-Group Discriminant Example

buyyes Willingness to buy another VCR

Checking Variance Assumption

Click Analyze…Descriptive Statistics…Explore

Figure 2.2 Plots Dialog Box in Explore

Click Continue, click OK

Figure 2.3 Levene Test for Homogeneity of Variance

Figure 2.4 Boxplot for Age Grouped by Buyyes

Running a Discriminant Analysis

Move buyyes into the Grouping Variable box

Figure 2.5 Discriminant Analysis Main Dialog Box

Click the Statistics button

Figure 2.6 Statistics Subdialog Box

Figure 2.7 Classification Subdialog Box

Figure 2.8 Save New Variables Subdialog Box

Click Continue, click OK

Figure 2.9 Summary of Cases Used in Analysis

Scroll down to Eigenvalues and Wilks’ Lambda tables

Figure 2.10 Overall results for the Discriminant Model

Figure 2.11 Box’s M Test of Equality of Covariance Matrices

Figure 2.12 Standardized Canonical Coefficients