DiscriminantAnalysis CompleteProblems Summer2003

SW388R7
Strategy for Complete discriminant Analysis

Data Analysis &
Computers II
Slide 1
Assumption of normality, linearity, and homogeneity
Outliers
Multicollinearity
Validation
Sample problem
Steps in solving problems

Assumptions of normality, linearity, and homogeneity
SW388R7
Data Analysis &
Computers II
Slide 2
of variance
 The ability of discriminant analysis to extract discriminant functions
that are capable of producing accurate classifications is enhanced
when the assumptions of normality, linearity, and homogeneity of
variance are satisfied.
 We will use the script for testing for normality and test substituting
the log, square root, or inverse transformation when they induce
normality in a variable that fails to satisfy the criteria for normality.
 We can compare the accuracy rates in a model using transformed

variables to one that does not to evaluate whether or not the
improvement gained by transformed variables is sufficient to justify
the interpretational burden of explaining transformations.
SW388R7
Assumption of linearity in discriminant analysis

Data Analysis &
Computers II
Slide 3
 Since the dependent variable is non-metric in discriminant analysis,

there is not a linear relationship between the dependent variable and
an independent variable.
 In discriminant analysis, the assumption of linearity applies to the

relationships between pairs of independent variable. To identify
violations of linearity, each metric independent variable would have to
be tested against all others.
 Since non-linearity only reduces the power to detect relationships, the

general advice is to attend to it only when we know that a variable in
our analysis consistently demonstrated non-linear relationships with
other independent variables.
 We will not test for linearity in our problems.

SW388R7
Assumption of homogeneity of variance

Data Analysis &
Computers II
Slide 4
 The assumption of homogeneity of variance is particular important in

the classification stage of discriminant analysis.
 If one of the groups defined by the dependent variable has greater

dispersion than others, cases will tend to be over classified in it.
 Homogeneity of variance is tested with Box's M test, which tests the

null hypotheses that the group variance-covariance matrices are equal.
If we fail to reject this null hypothesis and conclude that the
variances are equal, we use the SPSS default of using a pooled
covariance matrix in classification.
 If we reject the null hypothesis and conclude that the variances are
heterogeneous, we substitute separate covariance matrices in the
classification, and evaluate whether or not our classification accuracy
is improved.
SW388R7
Detecting outliers in discriminant analysis - 1

Data Analysis &
Computers II
Slide 5
 For multiple regression, we used z scores, studentized residuals, and

Mahalanobis distance as criteria for omitting a case from the analysis
as an outlier.
 Since the independent variables in a discriminant analysis are either

metric or dichotomous, Mahalanobis distance can be used to detect a
case that is an outlier for the combination of independent variables.
 Tabachnick suggests eliminating cases that are multivariate outliers

using Mahalanobis distance. In the output for discriminant analysis,
SPSS provides the Mahalanobis distance for the two groups that a case
is most likely to be belong to. SPSS suggests that this statistic can be
used to detect outlier. However, this is not useful for problems with
more than two groups defined by the dependent variable, and
computing Mahalanobis for each group does not result in the same
value in the SPSS output.
 In discriminant analysis, the group membership for each case is

predicted. Cases for which there is an erroneous prediction could be
considered errors and some of these could probably be labeled
outliers.
SW388R7
Detecting outliers in discriminant analysis - 2

Data Analysis &
Computers II
Slide 6
 The strategy that we will use for detecting outliers is testing each case
as a multivariate outlier, and omitting those cases where the
probability of the Mahalanobis distances is less than or equal to 0.001.
 The script for detecting outliers computes the Mahalanobis distance

and probability for each case. We will ignore the z score and the
studentized residual.
SW388R7
Multicollinearity
Data Analysis &
Computers II
Slide 7
 Multicollinearity has the same effect in discriminant analysis

that it does in multiple regression, i.e. the importance of an
independent variable will be undervalued because it has a very
strong relationship to another independent variable or
combination of independent variables.
 Like multiple regression, multicollinearity in discriminant
analysis is identified by examining tolerance values.
 While tolerance is routinely included in the output for the
stepwise method for including variables, it is not included for
simultaneous entry of variables. If a tolerance problem occurs
in a simultaneous entry problem, SPSS will include a table titled
"Variables Failing Tolerance Test."
 We should not attempt to interpret an analysis with a
multicollinearity problem until we have resolved the problem
by removing or combining the problematic variable.
SW388R7
Validation
Data Analysis &
Computers II
Slide 8
 The primary criteria for a successful discriminant analysis are:

 the existence of sufficient statistically significant
discriminant functions to distinguish among the groups
defined by the dependent variable, and
 an accuracy rate that substantially improves the accuracy
rate obtainable by chance alone.
 SPSS calculates a cross-validated accuracy rate for the analysis,

using a jackknife or leave-one-out at a time strategy. It
computes the discriminant analysis once for each case in the
sample, leaving the case out of the calculations for the
discriminant model. The discriminant model is then used to
classify the case that was left out or held out. Thus the bias
toward an optimistically high accuracy rate is avoided.
 We will use this cross-validation in our problems rather than

doing a separate 75-25% cross-validation.
SW388R7
Overall strategy for solving problems

Data Analysis &
Computers II
Slide 9
1. Run a baseline discriminant analysis using the method for including

variables implied by the problem statement to find the baseline
cross-validated accuracy rate for the model.
2. Test for useful transformations to improve normality.
3. Substitute transformed variables and check for outliers.
4. If cross-validated accuracy rate from discriminant analysis using
transformed variables and omitting outliers is at least 2% better than
baseline cross-validated accuracy rate, select it for interpretation;
otherwise select baseline model.
5. If the Box’s M statistic is statistically significant, we violate the
assumption of homogeneity of variance and re-run the analysis using
separate covariance matrices for classification. If the accuracy rate
increases by more than 2%, we interpret this model, otherwise return
to model using pooled covariance.
6. If the cross-validated accuracy rate is 25% or more higher than
proportional by chance accuracy rate, interpret the selected
discriminant model:
 Number of functions and importance of predictors
 Role of individual variables on functions distinguishing among groups
SW388R7
Problem 1
Data Analysis &
Computers II
Slide 10
In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Use a
level of significance of 0.05 for the statistical analysis. Use a level of significance of 0.01 for evaluating missing
data and assumptions.
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for
distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number
of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed"
[educ]. These predictors differentiate survey respondents who thought we spend too little money on welfare
from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too much money on welfare.
The most important predictor of groups based on responses to opinion about spending on welfare was number
of hours worked in the past week. The second most important predictor of groups based on responses to
opinion about spending on welfare was self-employment. The third most important predictor of groups based
on responses to opinion about spending on welfare was highest year of school completed.
Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in
the past week than survey respondents who thought we spend too little or too much money on welfare. Survey
respondents who thought we spend about the right amount of money on welfare had completed more years of
school than survey respondents who thought we spend too little or too much money on welfare. Survey
respondents who thought we spend too much money on welfare were more likely to be self-employed than
survey respondents who thought we spend too little money on welfare.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
SW388R7
Dissecting problem 1 - 1
Data Analysis &
Computers II
Slide 11
The problem may give us different levels

of significance for the analysis.
In this problem, we are told to use 0.05

as alpha for the discriminant analysis,
but 0.01 for testing assumptions.
In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a
statistic? Use a level of significance of 0.05 for the statistical analysis. Use a level of
significance of 0.01 for evaluating missing data and assumptions.
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too little money on welfare from survey respondents who
thought we spend about the right amount of money on welfare who, in turn, are differentiated
from survey respondents who thought we spend too much money on welfare.
SW388R7
Data Analysis &
Computers II
Slide 12
The variables listed first in the problem

statement are the independent variables
(IVs): "number of hours worked in the past
week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ],
In the
and dataset
"income"GSS2000R,
[rincom98]. is the following statement true, false, or an incorrect application of a
From the list of variables "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98],
the most useful predictors for distinguishing among groups based on responses to "opinion
about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1],
"self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors
differentiate survey respondents who thought we spend too little money on welfare from
survey respondents who thought we spend about the right amount of money on welfare who, in
The variable used to define
turn, are differentiated from survey respondents who thought we spend too much money on
groups is the dependent When a problem asks us
welfare.
variable (DV): "opinion about to identify the best or
spending on welfare"
The most important predictor of groups based on responsesmost useful about
to opinion predictors
spending on
[natfare].
welfare was number of hours worked in the past week. Thefrom a list
second of important predictor of
most
independent
groups based on responses to opinion about spending on welfare variables,
was self-employment. The
we do stepwise
third most important predictor of groups based on responses to opinion about spending on
welfare was highest year of school completed. discriminant analysis.
SW388R7
Data Analysis &
Computers II
Slide 13
The problem identifies three groups for the dependent variable:

survey respondents who thought we spend too much money on welfare
survey respondents who thought we spend about the right amount of
money on welfare
survey respondents who thought we spend too little money on welfare.
To distinguish among three groups, the analysis will be required to find

two statistically significant discriminant functions.
In the dataset GSS2000R, is the following statement true, false, or an incorrect application of
a statistic? Use a level of significance of 0.05 for the statistical analysis. Use a level of
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate
survey respondents who thought we spend too little money on welfare from survey
respondents who thought we spend about the right amount of money on welfare who, in
turn, are differentiated from survey respondents who thought we spend too much money
on welfare.
SW388R7
Data Analysis &
Computers II
Slide 14
In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Use a level of
significance of 0.05 for the statistical analysis. Use a level of significance of 0.01 for evaluating missing data and assumptions.
From
In athe list of variables
stepwise "number
analysis, weofonly
hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school
completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to
interpret
"opinion aboutthe independent
spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
variables
and thatofare
"highest year entered
school in [educ]. These predictors differentiate survey respondents who thought we spend too
completed"
little money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in
the are
turn, stepwise analysis.
differentiated from survey respondents who thought we spend too much money on welfare.
The importance of individual
The most important predictor of groups based on responses to opinion about spending on welfare was number of hours
predictors
worked in the past week. The second most important predictor of groups based onisresponses
based on to order
opinion about spending on
welfare was self-employment. The third most important predictor ofof
groups
entry in the analysis.to opinion about spending
based on responses
on welfare was highest year of school completed.
Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week
than survey respondents who thought we spend too little or too much money on welfare. Survey respondents who thought we
spend about the right amount of money on welfare had completed more years of school than survey respondents who thought
we spend too little or too much money on welfare. Survey respondents who thought we spend too much money on welfare were
more likely to be self-employed than survey respondents who thought we spend too little money on welfare.
1. True
3. False
SW388R7
Data Analysis &
Computers II
Slide 15
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest
year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among
groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the
past
Theweek" [hrs1],
specific "self-employment"
relationships [wrkslf],
listed in and "highest year of school completed" [educ]. These predictors
the problem
differentiate survey respondents who thought
indicate how the independent variable relates we spend too little money on welfare from survey respondents who
thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey
to groups ofwho
respondents thethought
dependent variable,
we spend e.g.,
too much the on welfare.
money
mean for hours worked in the past week will
The most important predictor of groups based on responses to opinion about spending on welfare was number of
be lower
hours workedforinrespondents
the past week.whoThethink
secondwe spend
most important predictor of groups based on responses to opinion
the right
about amount
spending of money
on welfare versus respondents
was self-employment. The third most important predictor of groups based on responses
to
who think we spend too much or too little. year of school completed.
opinion about spending on welfare was highest
Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in
the past week than survey respondents who thought we spend too little or too much money on welfare.
Survey respondents who thought we spend about the right amount of money on welfare had completed more
years of school than survey respondents who thought we spend too little or too much money on welfare.
Survey respondents who thought we spend too much money on welfare were more likely to be self-employed
than survey respondents who thought we spend too little money on welfare.
1. True
3. False
In order for a stepwise analysis to be

true, we must have enough statistically
significant functions to distinguish among
the groups, the order of entry must be
correct, and each significant relationship
must be interpreted correctly.
SW388R7
LEVEL OF MEASUREMENT - 1
Data Analysis &
Computers II
Slide 16
In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a
respondents who thought we spend too much money on welfare from survey respondents
who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too little money on welfare.
Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents
Discriminant whorequires
analysis thoughtthat
we the
spend about the right amount of money
dependent
on welfare had completed
variable be non-metric and the independent variables who thought we
more years of school than survey respondents
spend too much or little
be money
metric orondichotomous.
welfare. Survey respondents
"Opinion who thought we spend too
about spending
much money on welfare on were more
welfare" likely to
[natfare] be ordinal
is an self-employed than survey respondents who
level variable,
thought we spend too which
little money
satisfieson welfare.
the level of measurement
requirement.
It contains three categories: survey respondents who

thought we spend too much money on welfare,
survey respondents who thought we spend about the
right amount of money on welfare, and survey
respondents who thought we spend too little money
on welfare.
SW388R7
LEVEL OF MEASUREMENT - 2
Data Analysis &
Computers II
Slide 17
about spending on welfare" [natfare] are "number of hours worked in the past week"
[hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These
predictors differentiate survey respondents who thought we spend too much money on
welfare from survey respondents who thought we spend about the right amount of money
on welfare who, in turn, are differentiated from survey respondents who thought we spend
too little money on welfare.
fewer hours
"Number in the
of hours past week
worked in thethan survey respondents who thought we spend too much or little
money
past on [hrs1]
week" welfare.
andSurvey respondents who thought we spend about the right amount of money
"highest
year of school
on welfare completed"
had completed[educ]
more years of school than survey respondents who thought we
are interval level variables, which
spend too much or little money on welfare. Survey respondents who thought we spend too
satisfies the level of measurement
much moneyfor
requirements ondiscriminant
welfare were more likely to be self-employed than survey
"Income" [rincom98] is anrespondents
ordinal level who
thought we spend too little money on welfare. variable. If we follow the convention of
analysis.
treating ordinal level variables as metric
variables, the level of measurement
requirement for discriminant analysis is
satisfied. Since some data analysts do
not agree with this convention, a note
"Self-employment" [wrkslf] is a of caution should be included in our
dichotomous or dummy-coded interpretation.
nominal variable which may be
included in discriminant analysis.
SW388R7
PATTERNS OF MISSING DATA - 1

Data Analysis &
Computers II
Slide 18
Run the
Run the script
script to
to check
check
missing data.
missing data. Move
Move thethe
variables included
variables included in thein the
analysis, mark
analysis, mark thethe option
option
form missing
form missing data,
data, specify
and
clickthe
that thedependent
OK button
variable is nonmetric and
click the OK button.
Be sure to specify
that the dependent
variable is nonmetric.
SW388R7

Data Analysis &
Computers II
Slide 19
Statistics
NUMBER OF HIGHEST R SELF-EMP

HOURS YEAR OF OR WORKS
WORKED SCHOOL RESPONDEN FOR
WELFARE LAST WEEK COMPLETED TS INCOME SOMEBODY
N Valid 253 176 269 168 250
Missing 17 94 1 102 20
Several variables were missing data for more than 5% of the

cases in the data set: "opinion about spending on welfare"
[natfare] was missing data for 6.3% of the cases (17 of 270
cases); "number of hours worked in the past week" [hrs1] was
missing data for 34.8% of the cases (94 of 270 cases); "self-
employment" [wrkslf] was missing data for 7.4% of the cases
(20 of 270 cases); and "income" [rincom98] was missing data
for 37.8% of the cases in the data set (102 of 270 cases).
Missing/valid dichotomous variables were created for these
variables to test whether the group of cases with missing data
differed significantly from the group of cases with valid data.
SW388R7

Data Analysis &
Computers II
Slide 20
There were significant differences in the

statistical tests comparing cases with
missing data to cases with valid data.
Cases who had missing data for the variable "number of hours
worked in the past week" [hrs1] had an average score on the
variable "highest year of school completed" [educ] that was 1.87
units lower than the average for cases who had valid data (t=-5.194,
p<0.001) and had an average score on the variable "income"
[rincom98] that was 5.32 units lower than the average for cases who
had valid data (t=-4.758, p<0.001).
SW388R7

Data Analysis &
Computers II
Slide 21
Since there were significant differences in the

statistical tests comparing cases with missing data
to cases with valid data, a caution was added to
the interpretation of any findings, pending further
analysis of the missing data pattern.
Cases who had missing data for the variable

"income" [rincom98] had an average score on the
variable "highest year of school completed" [educ]
that was 2.27 units lower than the average for
cases who had valid data (t=-6.287, p<0.001).
SW388R7
The baseline discriminant analysis

Data Analysis &
Computers II
Slide 22
We begin our analysis by

running a stepwise
discriminant analysis with
natfare as the dependent
variable and hrs1, wrkslf,
educ, and rincom98 as the
independent variables.
Select the Classify |

Discriminant… command
from the Analyze menu.
SW388R7
Selecting the dependent variable

Data Analysis &
Computers II
Slide 23
First, highlight the

dependent variable
natfare in the list
of variables.
Second, click on the right

arrow button to move the
dependent variable to the
Grouping Variable text box.
SW388R7
Defining the group values

Data Analysis &
Computers II
Slide 24
When SPSS moves the dependent variable to the

Grouping Variable textbox, it puts two question marks in
parentheses after the variable name. This is a reminder
that we have to enter the number that represent the
groups we want to include in the analysis.
First, to specify the

group numbers, click
on the Define Range…
button.
SW388R7
Completing the range of group values

Data Analysis &
Computers II
Slide 25
The value labels for natfare show

three categories:
1 = TOO LITTLE
2 = ABOUT RIGHT
3 = TOO MUCH
First, type in 1 in
The range of values that we need the Minimum text
to enter goes from 1 as the box.
minimum and 3 as the maximum.
Second, type in
3 in the Third, click on the
Maximum text Continue button to
box. close the dialog box.
Note: if we enter the wrong range of group

numbers, e.g., 1 to 2 instead of 1 to 3, SPSS
will only include groups 1 and 2 in the analysis.
SW388R7
Specifying the method for including variables

Data Analysis &
Computers II
Slide 26
SPSS provides us with two methods for including

variables: to enter all of the independent variables
at one time, and a stepwise method for selecting
variables using a statistical test to determine the
order in which variables are included.
Since the problem states the

importance of the best subset of
predictors, we mark the option
button to Use stepwise method.
SW388R7
Requesting statistics for the output

Data Analysis &
Computers II
Slide 27
Click on the Statistics…

button to select statistics
we will need for the
analysis.
SW388R7
Specifying statistical output

Data Analysis &
Computers II
Slide 28
First, mark the Means

checkbox on the Descriptives
panel. We will use the group
means in our interpretation.
Second, mark the Univariate

ANOVAs checkbox on the
Descriptives panel. Perusing
these tests suggests which
variables might be useful
descriminators.
Third, mark the Box’s M

checkbox. Box’s M statistic Fourth, click on the
evaluates conformity to the Continue button to
assumption of homogeneity of close the dialog box.
group variances.
SW388R7
Specifying details for the stepwise method

Data Analysis &
Computers II
Slide 29
Click on the Method…

button to specify the
specific statistical criteria to
use for including variables.
SW388R7
Details for the stepwise method

Data Analysis &
Computers II
Slide 30
First, mark the

Mahalanobis
distance option
button on the
Method panel.
Second, mark the

Third, click on
Summary of steps
the Continue
checkbox to
button to close
produce a summary
the dialog box.
table when a new
variable is added.
Third, click on the Fourth, type the level

option button Use of significance in the
probability of F so that Entry text box. The
we can incorporate the Removal value is twice
level of significance as large as the entry
specified in the problem. value.
SW388R7
Specifying details for classification

Data Analysis &
Computers II
Slide 31
Click on the Classify…

button to specify details for
the classification phase of
the analysis.
SW388R7
Details for classification - 1

Data Analysis &
Computers II
Slide 32
First, mark the option button to Compute from

group sizes on the Prior Probabilities panel.
This incorporates the size of the groups defined
by the dependent variable into the classification
of cases using the discriminant functions.
Second, mark the

Casewise results
checkbox on the
Display panel to
include
classification details
for each case in the
output.
Third, mark the Summary

table checkbox to include
summary tables
comparing actual and
predicted classification.
SW388R7

Data Analysis &
Computers II
Slide 33
Fourth, mark the Leave-one-out

classification checkbox to request SPSS to
include a cross-validated classification in
the output. This option produces a less
biased estimate of classification accuracy
by sequentially holding each case out of
the calculations for the discriminant
functions, and using the derived functions
to classify the case held out.
SW388R7

Data Analysis &
Computers II
Slide 34
Fifth, accept the default of Within-groups Seventh, click

option button on the Use Covariance Matrix on the Continue
panel. The Covariance matrices are the button to close
measure of the dispersion in the groups the dialog box.
defined by the dependent variable. If we
fail the homogeneity of group variances
test (Box’s M), our option is use Separate
groups covariance in classification.
Sixth, mark the Combined-
groups checkbox on the Plots
panel to obtain a visual plot of
the relationship between
functions and groups defined
by the dependent variable.
SW388R7
Completing the discriminant analysis request

Data Analysis &
Computers II
Slide 35
Click on the OK
button to request the
output for the
discriminant
analysis.
Classification accuracy before
SW388R7
Data Analysis &
Computers II
Slide 36
transformations or removing outliers
Classification Resultsb,c
Predicted Group Membership

WELFARE 1 2 3 Total
Original Count 1 43 15 Prior to any
6 64
transformations
2 26 30 of variables
6 to satisfy
62 the
3 17 10 assumptions
9 of discriminant
36
Ungrouped cases 3 3 analysis 2or removal8 of
% 1 67.2 23.4 outliers,
9.4the cross-validated
100.0
2 41.9 48.4 accuracy9.7 rate was
100.050.0%.
3 47.2 27.8 25.0 100.0
This accuracy rate is the
Ungrouped cases 37.5 37.5 25.0 100.0
benchmark that we will use
Cross-validated a Count 1 43 15 6 64
to evaluate the utility of
2 26 30 transformations
6 and62 the
3 17 11 elimination
8 of outliers.
36
% 1 67.2 23.4 9.4 100.0
2 41.9 48.4 9.7 100.0
3 47.2 30.6 22.2 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case
is classified by the functions derived from all cases other than that case.
b. 50.6% of original grouped cases correctly classified.
c. 50.0% of cross-validated grouped cases correctly classified.
SW388R7
ASSUMPTION OF NORMALITY
Data Analysis &
Computers II
Slide 37
First, move the variables to the

list boxes based on the role that
the variable plays in the analysis
and its level of measurement.
Second, click on the Normality option

button to request that SPSS produce
the output needed to evaluate the
assumption of normality.
Fourth, mark the

dependent variable
as nonmetric.
Third, mark the checkboxes

for the transformations that Fifth, click on the
we want to test in evaluating OK button to
the assumption. produce the output.
Normality of independent variable:
SW388R7
Data Analysis &
Computers II
Slide 38
highest year of school completed
Descriptives
Statistic Std. Error

HIGHEST YEAR OF Mean 13.12 .179
SCHOOL COMPLETED 95% Confidence Lower Bound 12.77
Interval for Mean Upper Bound
13.47
5% Trimmed Mean 13.14

Median 13.00
Variance 8.583
Std. Deviation 2.930
Minimum 2
Maximum 20
Range 18
Interquartile Range 3.00
Skewness -.137 .149
Kurtosis 1.246 .296
The independent variable "highest year of

school completed" [educ] does not satisfy the
criteria for a normal distribution.
The skewness (-0.137) fell between -1.0 and

+1.0, but the kurtosis (1.246) fell outside the
range from -1.0 to +1.0.
SW388R7
Data Analysis &
Computers II
Slide 39
highest year of school completed
Neither the logarithmic, the square root,

nor the inverse transformation normalizes
the variable.
A caution was added to the findings.

SW388R7
Data Analysis &
Computers II
Slide 40
number of hours worked in the past week
Descriptives

NUMBER OF HOURS Mean 40.99 .958
WORKED LAST WEEK 95% Confidence Lower Bound 39.10
42.88
5% Trimmed Mean 41.21

Median 40.00
Variance 161.491
Std. Deviation 12.708
Minimum 4
Maximum 80
Range 76
Skewness -.324 .183
Kurtosis .935 .364
The variable "number of hours worked in the

past week" [hrs1] satisfies the criteria for a
normal distribution. The skewness (-0.324)
and kurtosis (0.935) were both between -1.0
and +1.0.
SW388R7
Data Analysis &
Computers II
Slide 41
income
Descriptives

RESPONDENTS INCOME Mean 13.35 .419
95% Confidence Lower Bound 12.52
14.18 The variable "income"
[rincom98] satisfies
5% Trimmed Mean 13.54 the criteria for a
Median 15.00 normal distribution.
Variance 29.535 The skewness (-
Std. Deviation 5.435 0.686) and kurtosis (-
Minimum 1 0.253) were both
Maximum 23 between
Range 22 -1.0 and +1.0.
Skewness -.686 .187
Kurtosis -.253 .373
SW388R7
Using the script to detect outliers

Data Analysis &
Computers II
Slide 42
Move the variables to the list

boxes for the dependent and
independent variables,
including transformed
variables that we have decided
to use.
Note: when detecting

outliers, clear the check box
for deleting variables to
Click on the Detect outliers keep SPSS from deleting
option button to request that the variables immediately
SPSS create the variables after it creates them.
needed to detect outliers.
Click on the OK
button to produce
the output.
SW388R7
Outliers in the data set

Data Analysis &
Computers II
Slide 43
A case can be characterized as an outlier if the

probability associated with the Mahalanobis D²
for the combination of independent variables is
less than 0.001.
The smallest probability for any case in the

data set is larger than 0.001. There were no
cases that could be classified as outliers.
SW388R7
Omitting outliers
Data Analysis &
Computers II
Slide 44
In this analysis, there were

no outliers. Had there been
outliers, we would have used
a select if command with this
formula to exclude them
from the analysis.
Classification accuracy using transformations and
SW388R7
Data Analysis &
Computers II
Slide 45
excluding outliers

WELFARE 1 2 3 Total
Original Count 1 Had we used a transformation
43 15for normality
6 or 64
identified outliers, we would have substituted the
2 26 30 6 62
transformed variable in the list of independent
3
variables, selected17 out cases 10 9
that were outliers, 36
Ungrouped cases
and run the revised3discriminant 3 analysis2 model. 8
% 1 67.2 23.4 9.4 100.0
2 If the cross-validated
41.9 accuracy
48.4 rate for9.7
the revised
100.0
3 model were more47.2 accurate 27.8
by 2% or more,
25.0 we 100.0
would
Ungrouped select the revised
cases 37.5 model
37.5 for interpretation.
25.0 100.0
Cross-validated a Count 1 43 15 6 64
2 26 30 6 62
3 17 11 8 36
% 1 67.2 23.4 9.4 100.0
2 41.9 48.4 9.7 100.0
3 Since no 30.6
47.2 useful transformations
22.2 100.0of
variables were identified in the
a. Cross validation is done only for those cases in the analysis. In cross validation, each case
evaluation of normality for discriminant
is classified by the functions derived from all cases other than that case.
analysis and no outliers were identified
b. 50.6% of original grouped cases correctly classified.as candidates for removal from the
analysis, the baseline discriminant
analysis with all cases and the original
form of all variables will be interpreted.
SW388R7
SAMPLE SIZE - 1
Data Analysis &
Computers II
Slide 46
Analysis Case Processing Summary
Unweighted Cases N Percent

Valid 138 51.1
Excluded Missing or out-of-range
7 2.6
group codes
At least one missing
115 42.6
discriminating variable The minimum ratio of valid
Both missing or cases to independent
out-of-range group codes variables for discriminant
10 3.7
and at least one missing analysis is 5 to 1, with a
discriminating variable preferred ratio of 20 to 1.
Total 132 48.9 In this analysis, there are
Total 270 100.0 138 valid cases and 4
independent variables.
The ratio of cases to

independent variables is
34.5 to 1, which satisfies
the minimum requirement.
In addition, the ratio of
34.5 to 1 satisfies the
preferred ratio of 20 to 1.
SW388R7
SAMPLE SIZE - 2
Data Analysis &
Computers II
Slide 47
Prior Probabilities for Groups
Cases Used in Analysis

In addition to the requirement for the
WELFARE Prior Unweighted Weighted
ratio of cases to independent
1 .406 56 56.000
variables, discriminant analysis
2 .362 50 50.000 requires that there be a minimum
3 .232 32 32.000 number of cases in the smallest group
Total 1.000 138 138.000 defined by the dependent variable.
The number of cases in the smallest
group must be larger than the number
of independent variables, and
preferably contain 20 or more cases.
The number of cases in the smallest

group in this problem is 32, which is
larger than the number of
independent variables (4), satisfying
the minimum requirement. In
addition, the number of cases in the
smallest group satisfies the preferred
minimum of 20 cases.
ASSUMPTION OF EQUAL DISPERSION FOR DEPENDENT
SW388R7
Data Analysis &
Computers II
Slide 48
VARIABLE GROUPS
The assumption of equal

dispersion for groups defined by
the dependent variable only
affects the classification phase of
discriminant analysis, and so is
not evaluated until we are
determining the final accuracy
rate of the model.
Box's M test evaluated the

homogeneity of dispersion
matrices across the subgroups of
the dependent variable. The null
hypothesis is that the dispersion
matrices are homogenous. If
the analysis fails this test, we
request the use of separate
group dispersion matrices in the
classification phase of the
discriminant analysis to see if
this improves our accuracy rate.
SW388R7
Data Analysis &
Computers II
Slide 49
VARIABLE GROUPS
In this analysis, Box's M statistic

had a value of 19.386 with a
probability of 0.096. Since the
probability for Box's M is greater
than the level of significance for
testing assumptions (0.01), the
null hypothesis is not rejected
and the assumption of equal
dispersion is satisfied.
We use the pooled or within-

groups covariance matrix for
classification.
SW388R7
Data Analysis &
Computers II
Slide 50
VARIABLE GROUPS
Had we rejected the null hypothesis and concluded that

dispersion was not equal across groups, we would have run
the analysis again, specifying separate-groups covariance
matrices for classification.
If classification using separate covariance matrices were

more accurate by 2% or more, we would report classification
accuracy based on this model rather than the one that use
within-groups covariance.
SW388R7
NUMBER OF DISCRIMINANT FUNCTIONS - 1

Data Analysis &
Computers II
Slide 51
The maximum possible number of discriminant

functions is the smaller of one less than the
number of groups defined by the dependent
variable and the number of independent
variables.
In this analysis there were 3 groups defined by

opinion about spending on welfare and 4
independent variables, so the maximum
possible number of discriminant functions was
2.
SW388R7
NUMBER OF DISCRIMINANT FUNCTIONS - 2

Data Analysis &
Computers II
Slide 52
In the table of Wilks' Lambda which tested functions for

statistical significance, the stepwise analysis identified 2
discriminant functions that were statistically significant. The
Wilks' lambda statistic for the test of function 1 through 2
functions (chi-square=21.853) had a probability of 0.001 which
was less than or equal to the level of significance of 0.05.
After removing function 1, the Wilks' lambda statistic for the

test of function 2 (chi-square=7.074) had a probability of
0.029 which was less than or equal to the level of
significance of 0.05. The significance of the maximum
possible number of discriminant functions supports the
interpretation of a solution using 2 discriminant functions.
SW388R7
MULTICOLLINEARITY
Data Analysis &
Computers II
Slide 53
Multicollinearity occurs when one

independent variable is so
strongly correlated with one or
more other variables that its
relationship to the dependent
variable is likely to be
misinterpreted. Its potential
unique contribution to explaining
the dependent variable is
minimized by its strong
relationship to other independent
variables. Multicollinearity is
indicated when the tolerance
value for an independent variable
is less than 0.10.
The tolerance values for all of the

independent variables are larger
than 0.10. Multicollinearity is not
a problem in this discriminant
analysis.
Independent variables and group membership:
SW388R7
Data Analysis &
Computers II
Slide 54
relationship of functions to groups
In order to specify the role that each independent

variable plays in predicting group membership on the
dependent variable, we must link together the
relationship between the discriminant functions and the
groups defined by the dependent variable, the role of
the significant independent variables in the
discriminant functions, and the differences in group
means for each of the variables.
Function 2 separates
Functions at Group Centroids survey respondents
who thought we spend
Function too little money on
WELFARE 1 2 welfare (positive value
of 0.235) from survey
1 -.220 .235 respondents who
2 .446 -.031 thought we spend too
3 -.311 -.362 much money (negative
value of -0.362) on
Unstandardized canonical discriminant welfare. We ignore the
functions evaluated at group means second group (-0.031)
Function 1 separates survey respondents in this comparison
who thought we spend about the right because it was
amount of money on welfare (the positive distinguished from the
value of 0.446) from survey respondents other two groups by
who thought we spend too much (negative function 1.
value of -0.311) or little money (negative
value of -0.220) on welfare.
SW388R7
Data Analysis &
Computers II
Slide 55
which predictors to interpret
Variables Entered/Removeda,b,c,d
Min. D Squared
Between Exact F
Step Entered Statistic Groups Statistic df1 df2 Sig.
1 NUMBER When we use the stepwise method of
OF variable inclusion, we limit our interpretation
HOURS of independent variable predictors to those
.023 1 and 3listed as statistically
.475 1 135.000
significant .492
in the table
WORKED
LAST of Variables Entered/Removed.
WEEK
We will interpret the impact on membership
2 R in groups defined by the dependent variable
SELF-EM by the independent variables:
P OR •number of hours worked in the past week
WORKS .251 1 and 2 •self-employment.
3.289 2 134.000 .040
FOR •highest year of school completed
SOMEBO
DY
3 HIGHEST
YEAR OF
SCHOOL .364 1 and 3 2.433 3 133.000 .068
COMPLE
Had we use simultaneous
TED entry of all variables, we
wouldbetween
At each step, the variable that maximizes the Mahalanobis distance not have imposed
the two closest this
groups is entered. limitation.
a. Maximum number of steps is 8.
b. Maximum significance of F to enter is .05.
c.
SW388R7
Data Analysis &
Computers II
Slide 56
predictor loadings on functions
Structure Matrix
Function
1 2
HIGHEST YEAR OF
.687* .136
SCHOOL COMPLETED
NUMBER OF HOURS
-.582* .345
WORKED LAST WEEK
R SELF-EMP OR WORKS
.223 .889*
FOR SOMEBODY
RESPONDENTS INCOMEa .101 .292*
Pooled within-groups correlations between discriminating
variables and standardized canonical discriminant functions
Variables ordered by absolute size of correlation within function.
Based on the structure
*. Largest absolute correlation between each variable and
matrix, the predictor
Based on the structure matrix,any thediscriminant function variable strongly
predictor variables strongly associated with
a. This variable not used in the analysis. associated with
discriminant function 1 which distinguished discriminant function 2
between survey respondents who thought which distinguished
we spend about the right amount of money between survey
on welfare and survey respondents who respondents who thought
thought we spend too much or little money we spend too little money
on welfare were number of hours worked in on welfare and survey
the past week (r=-0.582) and highest year respondents who thought
of school completed (r=0.687). we spend too much money
on welfare was self-
employment (r=0.889).
SW388R7
Data Analysis &
Computers II
Slide 57
predictors associated with first function - 1
Group Statistics
Valid N (listwise)
WELFARE Mean Std. Deviation Unweighted Weighted
1 TOO LITTLE NUMBER OF HOURS The average number of hours worked
43.96 13.240in the past56week 56.000
for survey
WORKED LAST WEEK
HIGHEST YEAR OF respondents who thought we spend
13.73 2.401about the 56
right amount
56.000 of money on
SCHOOL COMPLETED
welfare (mean=37.90) was lower
R SELF-EMP OR WORKS
1.93 .260than the average
56 number of hours
56.000
FOR SOMEBODY worked in the past weeks for survey
RESPONDENTS INCOME 13.70 5.034respondents
56 who56.000
thought we spend
2 ABOUT RIGHT NUMBER OF HOURS too little money on welfare
37.90 13.235(mean=43.96)
50 50.000
and survey
WORKED LAST WEEK
14.78 2.558too much money
50 on welfare
50.000
SCHOOL COMPLETED
(mean=42.03).
R SELF-EMP OR WORKS
1.90 .303 50 50.000
FOR SOMEBODY This supports the relationship that
RESPONDENTS INCOME 14.00 5.503"survey respondents
50 50.000who thought we
3 TOO MUCH NUMBER OF HOURS spend about the right amount of
42.03 10.456money on 32 32.000
welfare worked fewer
WORKED LAST WEEK
HIGHEST YEAR OF hours in the past week than survey
13.38 2.524respondents 32 who32.000
thought we spend
SCHOOL COMPLETED
too little or much money on welfare."
R SELF-EMP OR WORKS
1.75 .440 32 32.000
FOR SOMEBODY
RESPONDENTS INCOME 14.75 5.304 32 32.000
Total NUMBER OF HOURS
41.32 12.846 138 138.000
WORKED LAST WEEK
SW388R7
Data Analysis &
Computers II
Slide 58
predictors associated with first function - 2
Group Statistics
Valid N (listwise)
1 TOO LITTLE NUMBER OF HOURS
43.96 13.240The average
56 highest
56.000year of school
WORKED LAST WEEK
completed for survey respondents
HIGHEST YEAR OF
13.73 2.401who thought
56 we 56.000
spend about the
SCHOOL COMPLETED right amount of money on welfare
R SELF-EMP OR WORKS (mean=14.78) was higher than the
1.93 .260average highest
56 56.000
year of school
FOR SOMEBODY
RESPONDENTS INCOME 13.70 5.034completed56for survey
56.000 respondents
who thought we spend too little
2 ABOUT RIGHT NUMBER OF HOURS
37.90 13.235money on 50welfare (mean=13.73) and
50.000
WORKED LAST WEEK survey respondents who thought we
HIGHEST YEAR OF
14.78 2.558
spend too 50
much 50.000
money on welfare
SCHOOL COMPLETED (mean=13.38).
R SELF-EMP OR WORKS
1.90 .303This supports
50 the50.000
relationship that
FOR SOMEBODY
RESPONDENTS INCOME 14.00 5.503
"survey respondents
50
who thought we
50.000
spend about the right amount of
3 TOO MUCH NUMBER OF HOURS
42.03 10.456money on 32 welfare had completed
32.000
WORKED LAST WEEK more years of school than survey
13.38 2.524 32 32.000
SCHOOL COMPLETED too little or much money on welfare."
R SELF-EMP OR WORKS
1.75 .440 32 32.000
FOR SOMEBODY
41.32 12.846 138 138.000
WORKED LAST WEEK
SW388R7
Data Analysis &
Computers II
Slide 59
predictors associated with second function
Group Statistics
Valid N (listwise)
1 TOO LITTLE NUMBER OF HOURS Since self-employment is a dichotomous
43.96 13.240 variable, the
56 mean
56.000
is not directly
WORKED LAST WEEK
HIGHEST YEAR OF interpretable. Its interpretation must
13.73 2.401 take into 56
account the coding by which 1
56.000
SCHOOL COMPLETED
corresponds to self-employed and 2
R SELF-EMP OR WORKS
1.93 .260 corresponds
56 to someone
56.000 else. The lower
FOR SOMEBODY mean for survey respondents who
RESPONDENTS INCOME 13.70 5.034 thought we
56 spend too much money on
56.000
2 ABOUT RIGHT NUMBER OF HOURS welfare (mean=1.75), when compared
37.90 13.235 to the mean
50 for 50.000
survey respondents who
WORKED LAST WEEK
HIGHEST YEAR OF
thought we spend too little money on
14.78 2.558 welfare (mean=1.93),
50 50.000 implies that the
SCHOOL COMPLETED
group contained more survey
R SELF-EMP OR WORKS
1.90 .303 respondents
50 who were self-employed
50.000
FOR SOMEBODY and fewer survey respondents who were
RESPONDENTS INCOME 14.00 5.503 working for
50 someone
50.000 else.
3 TOO MUCH NUMBER OF HOURS
42.03 10.456 This supports
32 the relationship that
32.000
WORKED LAST WEEK
"survey respondents who thought we
HIGHEST YEAR OF
13.38 2.524 spend too32much32.000
money on welfare were
SCHOOL COMPLETED more likely to be self-employed than
.440 survey respondents
32.000who thought we
R SELF-EMP OR WORKS
1.75 32
FOR SOMEBODY spend too little money on welfare."
41.32 12.846 138 138.000
WORKED LAST WEEK
CLASSIFICATION USING THE DISCRIMINANT MODEL:
SW388R7
Data Analysis &
Computers II
Slide 60
by chance accuracy rate
The independent variables could be characterized as useful

predictors of membership in the groups defined by the dependent
variable if the cross-validated classification accuracy rate was
significantly higher than the accuracy attainable by chance alone.
Operationally, the cross-validated classification accuracy rate should
be 25% or more higher than the proportional by chance accuracy
rate.
The proportional by chance accuracy rate of was computed by

squaring and summing the proportion of cases in each group from
the table of prior probabilities for groups (0.406² + 0.362² + 0.232²
= 0.350).
Prior Probabilities for Groups
Cases Used in Analysis

WELFARE Prior Unweighted Weighted
1 TOO LITTLE .406 56 56.000
2 ABOUT RIGHT .362 50 50.000
3 TOO MUCH .232 32 32.000
Total 1.000 138 138.000
SW388R7
Data Analysis &
Computers II
Slide 61
criteria for classification accuracy

1 TOO 2 ABOUT
WELFARE LITTLE RIGHT 3 TOO MUCH Total
Original Count 1 TOO LITTLE 43 15 6 64
2 ABOUT RIGHT 26 30 6 62
3 TOO MUCH 17 10 9 36
Ungrouped cases 3 3 2 8
% 1 TOO LITTLE 67.2 23.4 9.4 100.0
2 ABOUT RIGHT 41.9 48.4 9.7 100.0
3 TOO MUCH 47.2 27.8 25.0 100.0
Ungrouped cases 37.5 37.5 25.0 100.0
Cross-validated a Count 1 TOO LITTLE 43 15 6 64
2 The cross-validated
ABOUT RIGHT accuracy
26 rate30 6 62
computed
3 TOO MUCH by SPSS was
17 50.0% 11 8 36
% which was
1 TOO LITTLE greater than or equal to
67.2 23.4 9.4 100.0
the proportional by chance accuracy
2 criteria
ABOUT RIGHT
of 43.7% (1.2541.9 x 35.0%48.4= 9.7 100.0
43.7%).
3 TOO MUCH The criteria for
47.2 30.6 22.2 100.0
classification accuracy is satisfied.
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.
SW388R7
Data Analysis &
Computers II
Slide 62
VALIDATION OF THE DISCRIMINANT ANALYSIS

1 TOO 2 ABOUT
WELFARE LITTLE RIGHT 3 TOO MUCH Total
Original Count 1 TOO LITTLE 43 15 6 64
2 ABOUT RIGHT 26 30 6 62
3 TOO MUCH 17 10 9 36
Ungrouped cases 3 3 2 8
% 1 TOO LITTLE 67.2 23.4 9.4 100.0
2 ABOUT RIGHT 41.9 48.4 9.7 100.0
3 TOO MUCH 47.2 27.8 25.0 100.0
Ungrouped cases 37.5 37.5 25.0 100.0
Cross-validated a Count The cross-validated
1 TOO LITTLE accuracy
43 rate
15 is a measure6 64
of the generalizabillity of the discriminant
2 ABOUT RIGHT
analysis 26 classifying
for correctly 30 populations not
6 62
included
3 TOO MUCH in the original
17 model. 11 Since the cross-8 36
% validated
1 TOO LITTLE classification
67.2 accuracy
23.4 rate (50.0%)
9.4 100.0
met or exceeded
2 ABOUT RIGHT
the proportional by chance
41.9
accuracy criteria (43.7%), 48.4
this requirement9.7 for 100.0
3 TOO MUCH
generalizability was
47.2 satisfied.
30.6 22.2 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.
SW388R7
Answering the problem question - 1

Data Analysis &
Computers II
Slide 63
about spending on welfare" [natfare] are "number of hours worked in the past week"
[hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These
predictors differentiate survey respondents who thought we spend too much money on welfare
from survey respondents who thought we spend about the right amount of money on welfare
The stepwise discriminant analysis
who, in turn, are differentiated from survey respondents who thought we spend too little
included the three variables identified
money on welfare. as the most useful predictors.
The most important predictor of groups based on responses to opinion about spending on
welfare was number of hours worked in the past week. The second most important predictor of
groups based on responses to opinion about spending on welfare was self-employment. The
third most important predictor of groups based on responses to opinion about spending on
welfare was highest year of school completed.
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than survey respondents who thought we
much money on welfare were more likely to be self-employed than survey respondents who
thought we spend too little money on welfare.
SW388R7

Data Analysis &
Computers II
Slide 64
respondents who thought we spend too much money on welfare from survey respondents
who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too little money on welfare.
welfare was number We
of hours
found worked in the past
two statistically week. The second most important predictor of
significant
groups based on responses to opinion
discriminant aboutmaking
functions, spending on welfare
it possible to was self-employment. The
third most importantdistinguish
predictor among
of groupsthebased
three on responses
groups defined to opinion about spending on
welfare was highest by theofdependent
year variable.
school completed.
Moreover, the cross-validated classification
Survey respondents who thought
accuracy we spend
surpassed the about the right
by chance amount of money on welfare worked
accuracy
criteria, supporting the utility of the model.
SW388R7

Data Analysis &
Computers II
Slide 65
respondents who thought we spend too much money on welfare from survey respondents who
The order of importance matched
thought we spend about the right theamount
order ofofentry
money on table
in the welfare
of who, in turn, are differentiated
from survey respondents who thought we Entered/Removed."
"Variables spend too little money on welfare.
welfare was number of hours worked in the past week. The second most important
predictor of groups based on responses to opinion about spending on welfare was self-
employment. The third most important predictor of groups based on responses to opinion
about spending on welfare was highest year of school completed.
SW388R7

Data Analysis &
Computers II
Slide 66
welfare was number of hours worked in the past week. The second most important predictor of
groups based on responses to opinion about
We spending on welfare
verified that was self-employment. The
each statement
third most important predictor of groups about
basedthe
on relationship
responses tobetween
opinion about spending on
welfare was highest year of school completed.
predictors and groups was correct.
Survey respondents who thought we spend about the right amount of money on welfare
worked fewer hours in the past week than survey respondents who thought we spend too
much or little money on welfare. Survey respondents who thought we spend about the right
amount of money on welfare had completed more years of school than survey respondents
who thought we spend too much or little money on welfare. Survey respondents who
thought we spend too much money on welfare were more likely to be self-employed than
survey respondents who thought we spend too little money on welfare.
1. True
The answer to the question is true with
2. True with caution caution. A caution is added because of
3. False the inclusion of ordinal level variables. A
caution is added because of a violation
4. Inappropriate application of a statistic of discriminant analysis assumptions.
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II
Slide 67
level of measurement
The following is a guide to the decision process for answering

problems about the complete discriminant analysis:
Dependent non-metric? No Inappropriate

Independent variables application of
metric or dichotomous? a statistic
Yes
SW388R7
Data Analysis &
Computers II
Slide 68
analyzing missing data
Create missing/valid group variable to

Is any variable missing use in t-tests with other metric
data for more than 5% of Yes variables in the analysis and chi-square
the cases in the data tests with other nonmetric variables in
the analysis.
set?
No
No Probability of t-tests or
chi-square tests <= level
of significance?
Yes
Add caution to interpretation to require

further work to understand pattern
Run baseline discriminant analysis, using method for

including variables identified in the research question.
Record cross-validated classification accuracy for
evaluation of transformations and removal of outliers.
SW388R7
Data Analysis &
Computers II
Slide 69
assumption of normality
If more than one

transformation
satisfies normality,
use one with
smallest skew
Log, square root, or

Metric IV's satisfy No inverse
criteria for a normal
transformation
distribution?
satisfies normality?
Yes
No
Yes
Add caution for Use transformation
violation of normality in revised model
If any variables were transformed for

normality or linearity, substitute
transformed variables in the detection
of outliers.
SW388R7
Data Analysis &
Computers II
Slide 70
detecting outliers
Is the probability of Mahalanobis Yes

D² for any case less than or Exclude outliers
equal to 0.001? from revised model
No
Run revised discriminant

using transformed variables
and omitting outliers.
SW388R7
Data Analysis &
Computers II
Slide 71
picking discriminant model for interpretation
Cross-validated accuracy
for revised discriminant
analysis > accuracy of
Yes baseline by 2% or more? No
Pick discriminant analysis with Pick baseline discriminant

transformations and omitting analysis for interpretation
outliers for interpretation
SW388R7
Data Analysis &
Computers II
Slide 72
sample size
Ratio of cases to No Inappropriate

independent variables at application of
least 5 to 1?
a statistic
Yes
Number of cases in
smallest group greater No Inappropriate
than number of application of
independent variables? a statistic
Yes
SW388R7
Data Analysis &
Computers II
Slide 73
assumption of equal dispersion
Probability of Box's M test Yes Re-run discriminant analysis, using

less than or equal to level of separate-groups covariance matrices
significance for assumptions? for classification
No
No Accuracy rate at least 2%
higher using separate-
groups covariance
matrices?
Yes
Pick discriminant analysis using Pick discriminant analysis using

within-groups covariance for separate-groups covariance for
interpretation interpretation
SW388R7
Data Analysis &
Computers II
Slide 74
usable discriminant model
Sufficient statistically No
significant functions to False
distinguish DV groups?
Yes
Tolerance for all IV’s

greater than 0.10, No
indicating no False
multicollinearity?
Yes
SW388R7
Data Analysis &
Computers II
Slide 75
relationships between IV's and DV
Stepwise method of entry

used to include
independent variables?
Yes
No
Entry order of variables
interpreted correctly?
No
False
Yes
Relationships between No
individual IVs and DV groups False
interpreted correctly?
Yes
SW388R7
Data Analysis &
Computers II
Slide 76
classification accuracy
Cross-validated accuracy is No
25% higher than proportional False
by chance accuracy rate?
Yes
SW388R7
Data Analysis &
Computers II
Slide 77
adding cautions to solution
Satisfies preferred ratio of No

cases to IV's of 20 to 1 True with caution
Yes
Satisfies preferred DV group No

minimum size of 20 cases? True with caution
Yes
DV is non-metric level and IVs No

are interval level or True with caution
dichotomous (not ordinal)?
Yes
True

DiscriminantAnalysis CompleteProblems Summer2003

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DiscriminantAnalysis CompleteProblems Summer2003

Uploaded by

Copyright:

Available Formats

SW388R7

Strategy for Complete discriminant Analysis

Assumption of normality, linearity, and homogeneity

Steps in solving problems

 We can compare the accuracy rates in a model using transformed

Assumption of linearity in discriminant analysis

 Since the dependent variable is non-metric in discriminant analysis,

 In discriminant analysis, the assumption of linearity applies to the

 Since non-linearity only reduces the power to detect relationships, the

 We will not test for linearity in our problems.

Assumption of homogeneity of variance

 The assumption of homogeneity of variance is particular important in

 If one of the groups defined by the dependent variable has greater

 Homogeneity of variance is tested with Box's M test, which tests the

Detecting outliers in discriminant analysis - 1

 For multiple regression, we used z scores, studentized residuals, and

 Since the independent variables in a discriminant analysis are either

 Tabachnick suggests eliminating cases that are multivariate outliers

 In discriminant analysis, the group membership for each case is

Detecting outliers in discriminant analysis - 2

 The script for detecting outliers computes the Mahalanobis distance

 Multicollinearity has the same effect in discriminant analysis

 The primary criteria for a successful discriminant analysis are:

 SPSS calculates a cross-validated accuracy rate for the analysis,

 We will use this cross-validation in our problems rather than

Overall strategy for solving problems

1. Run a baseline discriminant analysis using the method for including

The problem may give us different levels

In this problem, we are told to use 0.05

The variables listed first in the problem

The problem identifies three groups for the dependent variable:

To distinguish among three groups, the analysis will be required to find

In order for a stepwise analysis to be

It contains three categories: survey respondents who

PATTERNS OF MISSING DATA - 1

PATTERNS OF MISSING DATA - 2

NUMBER OF HIGHEST R SELF-EMP

Several variables were missing data for more than 5% of the

PATTERNS OF MISSING DATA - 3

There were significant differences in the

PATTERNS OF MISSING DATA - 4

Since there were significant differences in the

Cases who had missing data for the variable

The baseline discriminant analysis

We begin our analysis by

Select the Classify |

Selecting the dependent variable

First, highlight the

Second, click on the right

Defining the group values

When SPSS moves the dependent variable to the

First, to specify the

Completing the range of group values

The value labels for natfare show

Note: if we enter the wrong range of group

Specifying the method for including variables

SPSS provides us with two methods for including

Since the problem states the

Requesting statistics for the output

Click on the Statistics…

Specifying statistical output

First, mark the Means

Second, mark the Univariate