You are on page 1of 77

SW388R7

Strategy for Complete discriminant Analysis


Data Analysis &
Computers II

Slide 1

Assumption of normality, linearity, and homogeneity

Outliers

Multicollinearity

Validation

Sample problem

Steps in solving problems


Assumptions of normality, linearity, and homogeneity
SW388R7
Data Analysis &
Computers II

Slide 2
of variance
 The ability of discriminant analysis to extract discriminant functions
that are capable of producing accurate classifications is enhanced
when the assumptions of normality, linearity, and homogeneity of
variance are satisfied.

 We will use the script for testing for normality and test substituting
the log, square root, or inverse transformation when they induce
normality in a variable that fails to satisfy the criteria for normality.

 We can compare the accuracy rates in a model using transformed


variables to one that does not to evaluate whether or not the
improvement gained by transformed variables is sufficient to justify
the interpretational burden of explaining transformations.
SW388R7

Assumption of linearity in discriminant analysis


Data Analysis &
Computers II

Slide 3

 Since the dependent variable is non-metric in discriminant analysis,


there is not a linear relationship between the dependent variable and
an independent variable.

 In discriminant analysis, the assumption of linearity applies to the


relationships between pairs of independent variable. To identify
violations of linearity, each metric independent variable would have to
be tested against all others.

 Since non-linearity only reduces the power to detect relationships, the


general advice is to attend to it only when we know that a variable in
our analysis consistently demonstrated non-linear relationships with
other independent variables.

 We will not test for linearity in our problems.


SW388R7

Assumption of homogeneity of variance


Data Analysis &
Computers II

Slide 4

 The assumption of homogeneity of variance is particular important in


the classification stage of discriminant analysis.

 If one of the groups defined by the dependent variable has greater


dispersion than others, cases will tend to be over classified in it.

 Homogeneity of variance is tested with Box's M test, which tests the


null hypotheses that the group variance-covariance matrices are equal.
If we fail to reject this null hypothesis and conclude that the
variances are equal, we use the SPSS default of using a pooled
covariance matrix in classification.

 If we reject the null hypothesis and conclude that the variances are
heterogeneous, we substitute separate covariance matrices in the
classification, and evaluate whether or not our classification accuracy
is improved.
SW388R7

Detecting outliers in discriminant analysis - 1


Data Analysis &
Computers II

Slide 5

 For multiple regression, we used z scores, studentized residuals, and


Mahalanobis distance as criteria for omitting a case from the analysis
as an outlier.

 Since the independent variables in a discriminant analysis are either


metric or dichotomous, Mahalanobis distance can be used to detect a
case that is an outlier for the combination of independent variables.

 Tabachnick suggests eliminating cases that are multivariate outliers


using Mahalanobis distance. In the output for discriminant analysis,
SPSS provides the Mahalanobis distance for the two groups that a case
is most likely to be belong to. SPSS suggests that this statistic can be
used to detect outlier. However, this is not useful for problems with
more than two groups defined by the dependent variable, and
computing Mahalanobis for each group does not result in the same
value in the SPSS output.

 In discriminant analysis, the group membership for each case is


predicted. Cases for which there is an erroneous prediction could be
considered errors and some of these could probably be labeled
outliers.
SW388R7

Detecting outliers in discriminant analysis - 2


Data Analysis &
Computers II

Slide 6

 The strategy that we will use for detecting outliers is testing each case
as a multivariate outlier, and omitting those cases where the
probability of the Mahalanobis distances is less than or equal to 0.001.

 The script for detecting outliers computes the Mahalanobis distance


and probability for each case. We will ignore the z score and the
studentized residual.
SW388R7

Multicollinearity
Data Analysis &
Computers II

Slide 7

 Multicollinearity has the same effect in discriminant analysis


that it does in multiple regression, i.e. the importance of an
independent variable will be undervalued because it has a very
strong relationship to another independent variable or
combination of independent variables.
 Like multiple regression, multicollinearity in discriminant
analysis is identified by examining tolerance values.
 While tolerance is routinely included in the output for the
stepwise method for including variables, it is not included for
simultaneous entry of variables. If a tolerance problem occurs
in a simultaneous entry problem, SPSS will include a table titled
"Variables Failing Tolerance Test."
 We should not attempt to interpret an analysis with a
multicollinearity problem until we have resolved the problem
by removing or combining the problematic variable.
SW388R7

Validation
Data Analysis &
Computers II

Slide 8

 The primary criteria for a successful discriminant analysis are:


 the existence of sufficient statistically significant
discriminant functions to distinguish among the groups
defined by the dependent variable, and
 an accuracy rate that substantially improves the accuracy
rate obtainable by chance alone.

 SPSS calculates a cross-validated accuracy rate for the analysis,


using a jackknife or leave-one-out at a time strategy. It
computes the discriminant analysis once for each case in the
sample, leaving the case out of the calculations for the
discriminant model. The discriminant model is then used to
classify the case that was left out or held out. Thus the bias
toward an optimistically high accuracy rate is avoided.

 We will use this cross-validation in our problems rather than


doing a separate 75-25% cross-validation.
SW388R7

Overall strategy for solving problems


Data Analysis &
Computers II

Slide 9

1. Run a baseline discriminant analysis using the method for including


variables implied by the problem statement to find the baseline
cross-validated accuracy rate for the model.
2. Test for useful transformations to improve normality.
3. Substitute transformed variables and check for outliers.
4. If cross-validated accuracy rate from discriminant analysis using
transformed variables and omitting outliers is at least 2% better than
baseline cross-validated accuracy rate, select it for interpretation;
otherwise select baseline model.
5. If the Box’s M statistic is statistically significant, we violate the
assumption of homogeneity of variance and re-run the analysis using
separate covariance matrices for classification. If the accuracy rate
increases by more than 2%, we interpret this model, otherwise return
to model using pooled covariance.
6. If the cross-validated accuracy rate is 25% or more higher than
proportional by chance accuracy rate, interpret the selected
discriminant model:
 Number of functions and importance of predictors
 Role of individual variables on functions distinguishing among groups
SW388R7

Problem 1
Data Analysis &
Computers II

Slide 10

In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Use a
level of significance of 0.05 for the statistical analysis. Use a level of significance of 0.01 for evaluating missing
data and assumptions.

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for
distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number
of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed"
[educ]. These predictors differentiate survey respondents who thought we spend too little money on welfare
from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too much money on welfare.

The most important predictor of groups based on responses to opinion about spending on welfare was number
of hours worked in the past week. The second most important predictor of groups based on responses to
opinion about spending on welfare was self-employment. The third most important predictor of groups based
on responses to opinion about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in
the past week than survey respondents who thought we spend too little or too much money on welfare. Survey
respondents who thought we spend about the right amount of money on welfare had completed more years of
school than survey respondents who thought we spend too little or too much money on welfare. Survey
respondents who thought we spend too much money on welfare were more likely to be self-employed than
survey respondents who thought we spend too little money on welfare.

1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
SW388R7

Dissecting problem 1 - 1
Data Analysis &
Computers II

Slide 11

The problem may give us different levels


of significance for the analysis.

In this problem, we are told to use 0.05


as alpha for the discriminant analysis,
but 0.01 for testing assumptions.

In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a
statistic? Use a level of significance of 0.05 for the statistical analysis. Use a level of
significance of 0.01 for evaluating missing data and assumptions.
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too little money on welfare from survey respondents who
thought we spend about the right amount of money on welfare who, in turn, are differentiated
from survey respondents who thought we spend too much money on welfare.
SW388R7

Dissecting problem 1 - 2
Data Analysis &
Computers II

Slide 12

The variables listed first in the problem


statement are the independent variables
(IVs): "number of hours worked in the past
week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ],
In the
and dataset
"income"GSS2000R,
[rincom98]. is the following statement true, false, or an incorrect application of a
statistic? Use a level of significance of 0.05 for the statistical analysis. Use a level of
significance of 0.01 for evaluating missing data and assumptions.
From the list of variables "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98],
the most useful predictors for distinguishing among groups based on responses to "opinion
about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1],
"self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors
differentiate survey respondents who thought we spend too little money on welfare from
survey respondents who thought we spend about the right amount of money on welfare who, in
The variable used to define
turn, are differentiated from survey respondents who thought we spend too much money on
groups is the dependent When a problem asks us
welfare.
variable (DV): "opinion about to identify the best or
spending on welfare"
The most important predictor of groups based on responsesmost useful about
to opinion predictors
spending on
[natfare].
welfare was number of hours worked in the past week. Thefrom a list
second of important predictor of
most
independent
groups based on responses to opinion about spending on welfare variables,
was self-employment. The
we do stepwise
third most important predictor of groups based on responses to opinion about spending on
welfare was highest year of school completed. discriminant analysis.
SW388R7

Dissecting problem 1 - 3
Data Analysis &
Computers II

Slide 13

The problem identifies three groups for the dependent variable:


survey respondents who thought we spend too much money on welfare
survey respondents who thought we spend about the right amount of
money on welfare
survey respondents who thought we spend too little money on welfare.

To distinguish among three groups, the analysis will be required to find


two statistically significant discriminant functions.

In the dataset GSS2000R, is the following statement true, false, or an incorrect application of
a statistic? Use a level of significance of 0.05 for the statistical analysis. Use a level of
significance of 0.01 for evaluating missing data and assumptions.
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate
survey respondents who thought we spend too little money on welfare from survey
respondents who thought we spend about the right amount of money on welfare who, in
turn, are differentiated from survey respondents who thought we spend too much money
on welfare.
SW388R7

Dissecting problem 1 - 4
Data Analysis &
Computers II

Slide 14

In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Use a level of
significance of 0.05 for the statistical analysis. Use a level of significance of 0.01 for evaluating missing data and assumptions.
From
In athe list of variables
stepwise "number
analysis, weofonly
hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school
completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to
interpret
"opinion aboutthe independent
spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
variables
and thatofare
"highest year entered
school in [educ]. These predictors differentiate survey respondents who thought we spend too
completed"
little money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in
the are
turn, stepwise analysis.
differentiated from survey respondents who thought we spend too much money on welfare.
The importance of individual
The most important predictor of groups based on responses to opinion about spending on welfare was number of hours
predictors
worked in the past week. The second most important predictor of groups based onisresponses
based on to order
opinion about spending on
welfare was self-employment. The third most important predictor ofof
groups
entry in the analysis.to opinion about spending
based on responses
on welfare was highest year of school completed.
Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week
than survey respondents who thought we spend too little or too much money on welfare. Survey respondents who thought we
spend about the right amount of money on welfare had completed more years of school than survey respondents who thought
we spend too little or too much money on welfare. Survey respondents who thought we spend too much money on welfare were
more likely to be self-employed than survey respondents who thought we spend too little money on welfare.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
SW388R7

Dissecting problem 1 - 5
Data Analysis &
Computers II

Slide 15

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest
year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among
groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the
past
Theweek" [hrs1],
specific "self-employment"
relationships [wrkslf],
listed in and "highest year of school completed" [educ]. These predictors
the problem
differentiate survey respondents who thought
indicate how the independent variable relates we spend too little money on welfare from survey respondents who
thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey
to groups ofwho
respondents thethought
dependent variable,
we spend e.g.,
too much the on welfare.
money
mean for hours worked in the past week will
The most important predictor of groups based on responses to opinion about spending on welfare was number of
be lower
hours workedforinrespondents
the past week.whoThethink
secondwe spend
most important predictor of groups based on responses to opinion
the right
about amount
spending of money
on welfare versus respondents
was self-employment. The third most important predictor of groups based on responses
to
who think we spend too much or too little. year of school completed.
opinion about spending on welfare was highest
Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in
the past week than survey respondents who thought we spend too little or too much money on welfare.
Survey respondents who thought we spend about the right amount of money on welfare had completed more
years of school than survey respondents who thought we spend too little or too much money on welfare.
Survey respondents who thought we spend too much money on welfare were more likely to be self-employed
than survey respondents who thought we spend too little money on welfare.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

In order for a stepwise analysis to be


true, we must have enough statistically
significant functions to distinguish among
the groups, the order of entry must be
correct, and each significant relationship
must be interpreted correctly.
SW388R7

LEVEL OF MEASUREMENT - 1
Data Analysis &
Computers II

Slide 16

In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a
statistic? Use a level of significance of 0.05 for the statistical analysis. Use a level of
significance of 0.01 for evaluating missing data and assumptions.

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too much money on welfare from survey respondents
who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too little money on welfare.

Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents
Discriminant whorequires
analysis thoughtthat
we the
spend about the right amount of money
dependent
on welfare had completed
variable be non-metric and the independent variables who thought we
more years of school than survey respondents
spend too much or little
be money
metric orondichotomous.
welfare. Survey respondents
"Opinion who thought we spend too
about spending
much money on welfare on were more
welfare" likely to
[natfare] be ordinal
is an self-employed than survey respondents who
level variable,
thought we spend too which
little money
satisfieson welfare.
the level of measurement
requirement.

It contains three categories: survey respondents who


thought we spend too much money on welfare,
survey respondents who thought we spend about the
right amount of money on welfare, and survey
respondents who thought we spend too little money
on welfare.
SW388R7

LEVEL OF MEASUREMENT - 2
Data Analysis &
Computers II

Slide 17

From the list of variables "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98],
the most useful predictors for distinguishing among groups based on responses to "opinion
about spending on welfare" [natfare] are "number of hours worked in the past week"
[hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These
predictors differentiate survey respondents who thought we spend too much money on
welfare from survey respondents who thought we spend about the right amount of money
on welfare who, in turn, are differentiated from survey respondents who thought we spend
too little money on welfare.

Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours
"Number in the
of hours past week
worked in thethan survey respondents who thought we spend too much or little
money
past on [hrs1]
week" welfare.
andSurvey respondents who thought we spend about the right amount of money
"highest
year of school
on welfare completed"
had completed[educ]
more years of school than survey respondents who thought we
are interval level variables, which
spend too much or little money on welfare. Survey respondents who thought we spend too
satisfies the level of measurement
much moneyfor
requirements ondiscriminant
welfare were more likely to be self-employed than survey
"Income" [rincom98] is anrespondents
ordinal level who
thought we spend too little money on welfare. variable. If we follow the convention of
analysis.
treating ordinal level variables as metric
variables, the level of measurement
requirement for discriminant analysis is
satisfied. Since some data analysts do
not agree with this convention, a note
"Self-employment" [wrkslf] is a of caution should be included in our
dichotomous or dummy-coded interpretation.
nominal variable which may be
included in discriminant analysis.
SW388R7

PATTERNS OF MISSING DATA - 1


Data Analysis &
Computers II

Slide 18

Run the
Run the script
script to
to check
check
missing data.
missing data. Move
Move thethe
variables included
variables included in thein the
analysis, mark
analysis, mark thethe option
option
form missing
form missing data,
data, specify
and
clickthe
that thedependent
OK button
variable is nonmetric and
click the OK button.

Be sure to specify
that the dependent
variable is nonmetric.
SW388R7

PATTERNS OF MISSING DATA - 2


Data Analysis &
Computers II

Slide 19

Statistics

NUMBER OF HIGHEST R SELF-EMP


HOURS YEAR OF OR WORKS
WORKED SCHOOL RESPONDEN FOR
WELFARE LAST WEEK COMPLETED TS INCOME SOMEBODY
N Valid 253 176 269 168 250
Missing 17 94 1 102 20

Several variables were missing data for more than 5% of the


cases in the data set: "opinion about spending on welfare"
[natfare] was missing data for 6.3% of the cases (17 of 270
cases); "number of hours worked in the past week" [hrs1] was
missing data for 34.8% of the cases (94 of 270 cases); "self-
employment" [wrkslf] was missing data for 7.4% of the cases
(20 of 270 cases); and "income" [rincom98] was missing data
for 37.8% of the cases in the data set (102 of 270 cases).
Missing/valid dichotomous variables were created for these
variables to test whether the group of cases with missing data
differed significantly from the group of cases with valid data.
SW388R7

PATTERNS OF MISSING DATA - 3


Data Analysis &
Computers II

Slide 20

There were significant differences in the


statistical tests comparing cases with
missing data to cases with valid data.

Cases who had missing data for the variable "number of hours
worked in the past week" [hrs1] had an average score on the
variable "highest year of school completed" [educ] that was 1.87
units lower than the average for cases who had valid data (t=-5.194,
p<0.001) and had an average score on the variable "income"
[rincom98] that was 5.32 units lower than the average for cases who
had valid data (t=-4.758, p<0.001).
SW388R7

PATTERNS OF MISSING DATA - 4


Data Analysis &
Computers II

Slide 21

Since there were significant differences in the


statistical tests comparing cases with missing data
to cases with valid data, a caution was added to
the interpretation of any findings, pending further
analysis of the missing data pattern.

Cases who had missing data for the variable


"income" [rincom98] had an average score on the
variable "highest year of school completed" [educ]
that was 2.27 units lower than the average for
cases who had valid data (t=-6.287, p<0.001).
SW388R7

The baseline discriminant analysis


Data Analysis &
Computers II

Slide 22

We begin our analysis by


running a stepwise
discriminant analysis with
natfare as the dependent
variable and hrs1, wrkslf,
educ, and rincom98 as the
independent variables.

Select the Classify |


Discriminant… command
from the Analyze menu.
SW388R7

Selecting the dependent variable


Data Analysis &
Computers II

Slide 23

First, highlight the


dependent variable
natfare in the list
of variables.

Second, click on the right


arrow button to move the
dependent variable to the
Grouping Variable text box.
SW388R7

Defining the group values


Data Analysis &
Computers II

Slide 24

When SPSS moves the dependent variable to the


Grouping Variable textbox, it puts two question marks in
parentheses after the variable name. This is a reminder
that we have to enter the number that represent the
groups we want to include in the analysis.

First, to specify the


group numbers, click
on the Define Range…
button.
SW388R7

Completing the range of group values


Data Analysis &
Computers II

Slide 25

The value labels for natfare show


three categories:
1 = TOO LITTLE
2 = ABOUT RIGHT
3 = TOO MUCH
First, type in 1 in
The range of values that we need the Minimum text
to enter goes from 1 as the box.
minimum and 3 as the maximum.

Second, type in
3 in the Third, click on the
Maximum text Continue button to
box. close the dialog box.

Note: if we enter the wrong range of group


numbers, e.g., 1 to 2 instead of 1 to 3, SPSS
will only include groups 1 and 2 in the analysis.
SW388R7

Specifying the method for including variables


Data Analysis &
Computers II

Slide 26

SPSS provides us with two methods for including


variables: to enter all of the independent variables
at one time, and a stepwise method for selecting
variables using a statistical test to determine the
order in which variables are included.

Since the problem states the


importance of the best subset of
predictors, we mark the option
button to Use stepwise method.
SW388R7

Requesting statistics for the output


Data Analysis &
Computers II

Slide 27

Click on the Statistics…


button to select statistics
we will need for the
analysis.
SW388R7

Specifying statistical output


Data Analysis &
Computers II

Slide 28

First, mark the Means


checkbox on the Descriptives
panel. We will use the group
means in our interpretation.

Second, mark the Univariate


ANOVAs checkbox on the
Descriptives panel. Perusing
these tests suggests which
variables might be useful
descriminators.

Third, mark the Box’s M


checkbox. Box’s M statistic Fourth, click on the
evaluates conformity to the Continue button to
assumption of homogeneity of close the dialog box.
group variances.
SW388R7

Specifying details for the stepwise method


Data Analysis &
Computers II

Slide 29

Click on the Method…


button to specify the
specific statistical criteria to
use for including variables.
SW388R7

Details for the stepwise method


Data Analysis &
Computers II

Slide 30

First, mark the


Mahalanobis
distance option
button on the
Method panel.

Second, mark the


Third, click on
Summary of steps
the Continue
checkbox to
button to close
produce a summary
the dialog box.
table when a new
variable is added.

Third, click on the Fourth, type the level


option button Use of significance in the
probability of F so that Entry text box. The
we can incorporate the Removal value is twice
level of significance as large as the entry
specified in the problem. value.
SW388R7

Specifying details for classification


Data Analysis &
Computers II

Slide 31

Click on the Classify…


button to specify details for
the classification phase of
the analysis.
SW388R7

Details for classification - 1


Data Analysis &
Computers II

Slide 32

First, mark the option button to Compute from


group sizes on the Prior Probabilities panel.
This incorporates the size of the groups defined
by the dependent variable into the classification
of cases using the discriminant functions.

Second, mark the


Casewise results
checkbox on the
Display panel to
include
classification details
for each case in the
output.

Third, mark the Summary


table checkbox to include
summary tables
comparing actual and
predicted classification.
SW388R7

Details for classification - 2


Data Analysis &
Computers II

Slide 33

Fourth, mark the Leave-one-out


classification checkbox to request SPSS to
include a cross-validated classification in
the output. This option produces a less
biased estimate of classification accuracy
by sequentially holding each case out of
the calculations for the discriminant
functions, and using the derived functions
to classify the case held out.
SW388R7

Details for classification - 3


Data Analysis &
Computers II

Slide 34

Fifth, accept the default of Within-groups Seventh, click


option button on the Use Covariance Matrix on the Continue
panel. The Covariance matrices are the button to close
measure of the dispersion in the groups the dialog box.
defined by the dependent variable. If we
fail the homogeneity of group variances
test (Box’s M), our option is use Separate
groups covariance in classification.
Sixth, mark the Combined-
groups checkbox on the Plots
panel to obtain a visual plot of
the relationship between
functions and groups defined
by the dependent variable.
SW388R7

Completing the discriminant analysis request


Data Analysis &
Computers II

Slide 35

Click on the OK
button to request the
output for the
discriminant
analysis.
Classification accuracy before
SW388R7
Data Analysis &
Computers II

Slide 36
transformations or removing outliers

Classification Resultsb,c

Predicted Group Membership


WELFARE 1 2 3 Total
Original Count 1 43 15 Prior to any
6 64
transformations
2 26 30 of variables
6 to satisfy
62 the
3 17 10 assumptions
9 of discriminant
36
Ungrouped cases 3 3 analysis 2or removal8 of
% 1 67.2 23.4 outliers,
9.4the cross-validated
100.0
2 41.9 48.4 accuracy9.7 rate was
100.050.0%.
3 47.2 27.8 25.0 100.0
This accuracy rate is the
Ungrouped cases 37.5 37.5 25.0 100.0
benchmark that we will use
Cross-validated a Count 1 43 15 6 64
to evaluate the utility of
2 26 30 transformations
6 and62 the
3 17 11 elimination
8 of outliers.
36
% 1 67.2 23.4 9.4 100.0
2 41.9 48.4 9.7 100.0
3 47.2 30.6 22.2 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case
is classified by the functions derived from all cases other than that case.
b. 50.6% of original grouped cases correctly classified.
c. 50.0% of cross-validated grouped cases correctly classified.
SW388R7

ASSUMPTION OF NORMALITY
Data Analysis &
Computers II

Slide 37

First, move the variables to the


list boxes based on the role that
the variable plays in the analysis
and its level of measurement.

Second, click on the Normality option


button to request that SPSS produce
the output needed to evaluate the
assumption of normality.

Fourth, mark the


dependent variable
as nonmetric.

Third, mark the checkboxes


for the transformations that Fifth, click on the
we want to test in evaluating OK button to
the assumption. produce the output.
Normality of independent variable:
SW388R7
Data Analysis &
Computers II

Slide 38
highest year of school completed
Descriptives

Statistic Std. Error


HIGHEST YEAR OF Mean 13.12 .179
SCHOOL COMPLETED 95% Confidence Lower Bound 12.77
Interval for Mean Upper Bound
13.47

5% Trimmed Mean 13.14


Median 13.00
Variance 8.583
Std. Deviation 2.930
Minimum 2
Maximum 20
Range 18
Interquartile Range 3.00
Skewness -.137 .149
Kurtosis 1.246 .296

The independent variable "highest year of


school completed" [educ] does not satisfy the
criteria for a normal distribution.

The skewness (-0.137) fell between -1.0 and


+1.0, but the kurtosis (1.246) fell outside the
range from -1.0 to +1.0.
Normality of independent variable:
SW388R7
Data Analysis &
Computers II

Slide 39
highest year of school completed

Neither the logarithmic, the square root,


nor the inverse transformation normalizes
the variable.

A caution was added to the findings.


Normality of independent variable:
SW388R7
Data Analysis &
Computers II

Slide 40
number of hours worked in the past week

Descriptives

Statistic Std. Error


NUMBER OF HOURS Mean 40.99 .958
WORKED LAST WEEK 95% Confidence Lower Bound 39.10
Interval for Mean Upper Bound
42.88

5% Trimmed Mean 41.21


Median 40.00
Variance 161.491
Std. Deviation 12.708
Minimum 4
Maximum 80
Range 76
Interquartile Range 10.00
Skewness -.324 .183
Kurtosis .935 .364

The variable "number of hours worked in the


past week" [hrs1] satisfies the criteria for a
normal distribution. The skewness (-0.324)
and kurtosis (0.935) were both between -1.0
and +1.0.
Normality of independent variable:
SW388R7
Data Analysis &
Computers II

Slide 41
income
Descriptives

Statistic Std. Error


RESPONDENTS INCOME Mean 13.35 .419
95% Confidence Lower Bound 12.52
Interval for Mean Upper Bound
14.18 The variable "income"
[rincom98] satisfies
5% Trimmed Mean 13.54 the criteria for a
Median 15.00 normal distribution.
Variance 29.535 The skewness (-
Std. Deviation 5.435 0.686) and kurtosis (-
Minimum 1 0.253) were both
Maximum 23 between
Range 22 -1.0 and +1.0.
Interquartile Range 8.00
Skewness -.686 .187
Kurtosis -.253 .373
SW388R7

Using the script to detect outliers


Data Analysis &
Computers II

Slide 42

Move the variables to the list


boxes for the dependent and
independent variables,
including transformed
variables that we have decided
to use.

Note: when detecting


outliers, clear the check box
for deleting variables to
Click on the Detect outliers keep SPSS from deleting
option button to request that the variables immediately
SPSS create the variables after it creates them.
needed to detect outliers.

Click on the OK
button to produce
the output.
SW388R7

Outliers in the data set


Data Analysis &
Computers II

Slide 43

A case can be characterized as an outlier if the


probability associated with the Mahalanobis D²
for the combination of independent variables is
less than 0.001.

The smallest probability for any case in the


data set is larger than 0.001. There were no
cases that could be classified as outliers.
SW388R7

Omitting outliers
Data Analysis &
Computers II

Slide 44

In this analysis, there were


no outliers. Had there been
outliers, we would have used
a select if command with this
formula to exclude them
from the analysis.
Classification accuracy using transformations and
SW388R7
Data Analysis &
Computers II

Slide 45
excluding outliers

Classification Resultsb,c

Predicted Group Membership


WELFARE 1 2 3 Total
Original Count 1 Had we used a transformation
43 15for normality
6 or 64
identified outliers, we would have substituted the
2 26 30 6 62
transformed variable in the list of independent
3
variables, selected17 out cases 10 9
that were outliers, 36
Ungrouped cases
and run the revised3discriminant 3 analysis2 model. 8
% 1 67.2 23.4 9.4 100.0
2 If the cross-validated
41.9 accuracy
48.4 rate for9.7
the revised
100.0
3 model were more47.2 accurate 27.8
by 2% or more,
25.0 we 100.0
would
Ungrouped select the revised
cases 37.5 model
37.5 for interpretation.
25.0 100.0
Cross-validated a Count 1 43 15 6 64
2 26 30 6 62
3 17 11 8 36
% 1 67.2 23.4 9.4 100.0
2 41.9 48.4 9.7 100.0
3 Since no 30.6
47.2 useful transformations
22.2 100.0of
variables were identified in the
a. Cross validation is done only for those cases in the analysis. In cross validation, each case
evaluation of normality for discriminant
is classified by the functions derived from all cases other than that case.
analysis and no outliers were identified
b. 50.6% of original grouped cases correctly classified.as candidates for removal from the
analysis, the baseline discriminant
c. 50.0% of cross-validated grouped cases correctly classified.
analysis with all cases and the original
form of all variables will be interpreted.
SW388R7

SAMPLE SIZE - 1
Data Analysis &
Computers II

Slide 46

Analysis Case Processing Summary

Unweighted Cases N Percent


Valid 138 51.1
Excluded Missing or out-of-range
7 2.6
group codes
At least one missing
115 42.6
discriminating variable The minimum ratio of valid
Both missing or cases to independent
out-of-range group codes variables for discriminant
10 3.7
and at least one missing analysis is 5 to 1, with a
discriminating variable preferred ratio of 20 to 1.
Total 132 48.9 In this analysis, there are
Total 270 100.0 138 valid cases and 4
independent variables.

The ratio of cases to


independent variables is
34.5 to 1, which satisfies
the minimum requirement.
In addition, the ratio of
34.5 to 1 satisfies the
preferred ratio of 20 to 1.
SW388R7

SAMPLE SIZE - 2
Data Analysis &
Computers II

Slide 47

Prior Probabilities for Groups

Cases Used in Analysis


In addition to the requirement for the
WELFARE Prior Unweighted Weighted
ratio of cases to independent
1 .406 56 56.000
variables, discriminant analysis
2 .362 50 50.000 requires that there be a minimum
3 .232 32 32.000 number of cases in the smallest group
Total 1.000 138 138.000 defined by the dependent variable.
The number of cases in the smallest
group must be larger than the number
of independent variables, and
preferably contain 20 or more cases.

The number of cases in the smallest


group in this problem is 32, which is
larger than the number of
independent variables (4), satisfying
the minimum requirement. In
addition, the number of cases in the
smallest group satisfies the preferred
minimum of 20 cases.
ASSUMPTION OF EQUAL DISPERSION FOR DEPENDENT
SW388R7
Data Analysis &
Computers II

Slide 48
VARIABLE GROUPS

The assumption of equal


dispersion for groups defined by
the dependent variable only
affects the classification phase of
discriminant analysis, and so is
not evaluated until we are
determining the final accuracy
rate of the model.

Box's M test evaluated the


homogeneity of dispersion
matrices across the subgroups of
the dependent variable. The null
hypothesis is that the dispersion
matrices are homogenous. If
the analysis fails this test, we
request the use of separate
group dispersion matrices in the
classification phase of the
discriminant analysis to see if
this improves our accuracy rate.
ASSUMPTION OF EQUAL DISPERSION FOR DEPENDENT
SW388R7
Data Analysis &
Computers II

Slide 49
VARIABLE GROUPS

In this analysis, Box's M statistic


had a value of 19.386 with a
probability of 0.096. Since the
probability for Box's M is greater
than the level of significance for
testing assumptions (0.01), the
null hypothesis is not rejected
and the assumption of equal
dispersion is satisfied.

We use the pooled or within-


groups covariance matrix for
classification.
ASSUMPTION OF EQUAL DISPERSION FOR DEPENDENT
SW388R7
Data Analysis &
Computers II

Slide 50
VARIABLE GROUPS

Had we rejected the null hypothesis and concluded that


dispersion was not equal across groups, we would have run
the analysis again, specifying separate-groups covariance
matrices for classification.

If classification using separate covariance matrices were


more accurate by 2% or more, we would report classification
accuracy based on this model rather than the one that use
within-groups covariance.
SW388R7

NUMBER OF DISCRIMINANT FUNCTIONS - 1


Data Analysis &
Computers II

Slide 51

The maximum possible number of discriminant


functions is the smaller of one less than the
number of groups defined by the dependent
variable and the number of independent
variables.

In this analysis there were 3 groups defined by


opinion about spending on welfare and 4
independent variables, so the maximum
possible number of discriminant functions was
2.
SW388R7

NUMBER OF DISCRIMINANT FUNCTIONS - 2


Data Analysis &
Computers II

Slide 52

In the table of Wilks' Lambda which tested functions for


statistical significance, the stepwise analysis identified 2
discriminant functions that were statistically significant. The
Wilks' lambda statistic for the test of function 1 through 2
functions (chi-square=21.853) had a probability of 0.001 which
was less than or equal to the level of significance of 0.05.

After removing function 1, the Wilks' lambda statistic for the


test of function 2 (chi-square=7.074) had a probability of
0.029 which was less than or equal to the level of
significance of 0.05. The significance of the maximum
possible number of discriminant functions supports the
interpretation of a solution using 2 discriminant functions.
SW388R7

MULTICOLLINEARITY
Data Analysis &
Computers II

Slide 53

Multicollinearity occurs when one


independent variable is so
strongly correlated with one or
more other variables that its
relationship to the dependent
variable is likely to be
misinterpreted. Its potential
unique contribution to explaining
the dependent variable is
minimized by its strong
relationship to other independent
variables. Multicollinearity is
indicated when the tolerance
value for an independent variable
is less than 0.10.

The tolerance values for all of the


independent variables are larger
than 0.10. Multicollinearity is not
a problem in this discriminant
analysis.
Independent variables and group membership:
SW388R7
Data Analysis &
Computers II

Slide 54
relationship of functions to groups

In order to specify the role that each independent


variable plays in predicting group membership on the
dependent variable, we must link together the
relationship between the discriminant functions and the
groups defined by the dependent variable, the role of
the significant independent variables in the
discriminant functions, and the differences in group
means for each of the variables.

Function 2 separates
Functions at Group Centroids survey respondents
who thought we spend
Function too little money on
WELFARE 1 2 welfare (positive value
of 0.235) from survey
1 -.220 .235 respondents who
2 .446 -.031 thought we spend too
3 -.311 -.362 much money (negative
value of -0.362) on
Unstandardized canonical discriminant welfare. We ignore the
functions evaluated at group means second group (-0.031)
Function 1 separates survey respondents in this comparison
who thought we spend about the right because it was
amount of money on welfare (the positive distinguished from the
value of 0.446) from survey respondents other two groups by
who thought we spend too much (negative function 1.
value of -0.311) or little money (negative
value of -0.220) on welfare.
Independent variables and group membership:
SW388R7
Data Analysis &
Computers II

Slide 55
which predictors to interpret

Variables Entered/Removeda,b,c,d

Min. D Squared

Between Exact F
Step Entered Statistic Groups Statistic df1 df2 Sig.
1 NUMBER When we use the stepwise method of
OF variable inclusion, we limit our interpretation
HOURS of independent variable predictors to those
.023 1 and 3listed as statistically
.475 1 135.000
significant .492
in the table
WORKED
LAST of Variables Entered/Removed.
WEEK
We will interpret the impact on membership
2 R in groups defined by the dependent variable
SELF-EM by the independent variables:
P OR •number of hours worked in the past week
WORKS .251 1 and 2 •self-employment.
3.289 2 134.000 .040
FOR •highest year of school completed
SOMEBO
DY
3 HIGHEST
YEAR OF
SCHOOL .364 1 and 3 2.433 3 133.000 .068
COMPLE
Had we use simultaneous
TED entry of all variables, we
wouldbetween
At each step, the variable that maximizes the Mahalanobis distance not have imposed
the two closest this
groups is entered. limitation.
a. Maximum number of steps is 8.
b. Maximum significance of F to enter is .05.
c.
Independent variables and group membership:
SW388R7
Data Analysis &
Computers II

Slide 56
predictor loadings on functions

Structure Matrix

Function
1 2
HIGHEST YEAR OF
.687* .136
SCHOOL COMPLETED
NUMBER OF HOURS
-.582* .345
WORKED LAST WEEK
R SELF-EMP OR WORKS
.223 .889*
FOR SOMEBODY
RESPONDENTS INCOMEa .101 .292*
Pooled within-groups correlations between discriminating
variables and standardized canonical discriminant functions
Variables ordered by absolute size of correlation within function.
Based on the structure
*. Largest absolute correlation between each variable and
matrix, the predictor
Based on the structure matrix,any thediscriminant function variable strongly
predictor variables strongly associated with
a. This variable not used in the analysis. associated with
discriminant function 1 which distinguished discriminant function 2
between survey respondents who thought which distinguished
we spend about the right amount of money between survey
on welfare and survey respondents who respondents who thought
thought we spend too much or little money we spend too little money
on welfare were number of hours worked in on welfare and survey
the past week (r=-0.582) and highest year respondents who thought
of school completed (r=0.687). we spend too much money
on welfare was self-
employment (r=0.889).
Independent variables and group membership:
SW388R7
Data Analysis &
Computers II

Slide 57
predictors associated with first function - 1

Group Statistics

Valid N (listwise)
WELFARE Mean Std. Deviation Unweighted Weighted
1 TOO LITTLE NUMBER OF HOURS The average number of hours worked
43.96 13.240in the past56week 56.000
for survey
WORKED LAST WEEK
HIGHEST YEAR OF respondents who thought we spend
13.73 2.401about the 56
right amount
56.000 of money on
SCHOOL COMPLETED
welfare (mean=37.90) was lower
R SELF-EMP OR WORKS
1.93 .260than the average
56 number of hours
56.000
FOR SOMEBODY worked in the past weeks for survey
RESPONDENTS INCOME 13.70 5.034respondents
56 who56.000
thought we spend
2 ABOUT RIGHT NUMBER OF HOURS too little money on welfare
37.90 13.235(mean=43.96)
50 50.000
and survey
WORKED LAST WEEK
HIGHEST YEAR OF respondents who thought we spend
14.78 2.558too much money
50 on welfare
50.000
SCHOOL COMPLETED
(mean=42.03).
R SELF-EMP OR WORKS
1.90 .303 50 50.000
FOR SOMEBODY This supports the relationship that
RESPONDENTS INCOME 14.00 5.503"survey respondents
50 50.000who thought we
3 TOO MUCH NUMBER OF HOURS spend about the right amount of
42.03 10.456money on 32 32.000
welfare worked fewer
WORKED LAST WEEK
HIGHEST YEAR OF hours in the past week than survey
13.38 2.524respondents 32 who32.000
thought we spend
SCHOOL COMPLETED
too little or much money on welfare."
R SELF-EMP OR WORKS
1.75 .440 32 32.000
FOR SOMEBODY
RESPONDENTS INCOME 14.75 5.304 32 32.000
Total NUMBER OF HOURS
41.32 12.846 138 138.000
WORKED LAST WEEK
Independent variables and group membership:
SW388R7
Data Analysis &
Computers II

Slide 58
predictors associated with first function - 2

Group Statistics

Valid N (listwise)
WELFARE Mean Std. Deviation Unweighted Weighted
1 TOO LITTLE NUMBER OF HOURS
43.96 13.240The average
56 highest
56.000year of school
WORKED LAST WEEK
completed for survey respondents
HIGHEST YEAR OF
13.73 2.401who thought
56 we 56.000
spend about the
SCHOOL COMPLETED right amount of money on welfare
R SELF-EMP OR WORKS (mean=14.78) was higher than the
1.93 .260average highest
56 56.000
year of school
FOR SOMEBODY
RESPONDENTS INCOME 13.70 5.034completed56for survey
56.000 respondents
who thought we spend too little
2 ABOUT RIGHT NUMBER OF HOURS
37.90 13.235money on 50welfare (mean=13.73) and
50.000
WORKED LAST WEEK survey respondents who thought we
HIGHEST YEAR OF
14.78 2.558
spend too 50
much 50.000
money on welfare
SCHOOL COMPLETED (mean=13.38).
R SELF-EMP OR WORKS
1.90 .303This supports
50 the50.000
relationship that
FOR SOMEBODY
RESPONDENTS INCOME 14.00 5.503
"survey respondents
50
who thought we
50.000
spend about the right amount of
3 TOO MUCH NUMBER OF HOURS
42.03 10.456money on 32 welfare had completed
32.000
WORKED LAST WEEK more years of school than survey
HIGHEST YEAR OF respondents who thought we spend
13.38 2.524 32 32.000
SCHOOL COMPLETED too little or much money on welfare."
R SELF-EMP OR WORKS
1.75 .440 32 32.000
FOR SOMEBODY
RESPONDENTS INCOME 14.75 5.304 32 32.000
Total NUMBER OF HOURS
41.32 12.846 138 138.000
WORKED LAST WEEK
Independent variables and group membership:
SW388R7
Data Analysis &
Computers II

Slide 59
predictors associated with second function

Group Statistics

Valid N (listwise)
WELFARE Mean Std. Deviation Unweighted Weighted
1 TOO LITTLE NUMBER OF HOURS Since self-employment is a dichotomous
43.96 13.240 variable, the
56 mean
56.000
is not directly
WORKED LAST WEEK
HIGHEST YEAR OF interpretable. Its interpretation must
13.73 2.401 take into 56
account the coding by which 1
56.000
SCHOOL COMPLETED
corresponds to self-employed and 2
R SELF-EMP OR WORKS
1.93 .260 corresponds
56 to someone
56.000 else. The lower
FOR SOMEBODY mean for survey respondents who
RESPONDENTS INCOME 13.70 5.034 thought we
56 spend too much money on
56.000
2 ABOUT RIGHT NUMBER OF HOURS welfare (mean=1.75), when compared
37.90 13.235 to the mean
50 for 50.000
survey respondents who
WORKED LAST WEEK
HIGHEST YEAR OF
thought we spend too little money on
14.78 2.558 welfare (mean=1.93),
50 50.000 implies that the
SCHOOL COMPLETED
group contained more survey
R SELF-EMP OR WORKS
1.90 .303 respondents
50 who were self-employed
50.000
FOR SOMEBODY and fewer survey respondents who were
RESPONDENTS INCOME 14.00 5.503 working for
50 someone
50.000 else.
3 TOO MUCH NUMBER OF HOURS
42.03 10.456 This supports
32 the relationship that
32.000
WORKED LAST WEEK
"survey respondents who thought we
HIGHEST YEAR OF
13.38 2.524 spend too32much32.000
money on welfare were
SCHOOL COMPLETED more likely to be self-employed than
.440 survey respondents
32.000who thought we
R SELF-EMP OR WORKS
1.75 32
FOR SOMEBODY spend too little money on welfare."
RESPONDENTS INCOME 14.75 5.304 32 32.000
Total NUMBER OF HOURS
41.32 12.846 138 138.000
WORKED LAST WEEK
CLASSIFICATION USING THE DISCRIMINANT MODEL:
SW388R7
Data Analysis &
Computers II

Slide 60
by chance accuracy rate

The independent variables could be characterized as useful


predictors of membership in the groups defined by the dependent
variable if the cross-validated classification accuracy rate was
significantly higher than the accuracy attainable by chance alone.
Operationally, the cross-validated classification accuracy rate should
be 25% or more higher than the proportional by chance accuracy
rate.

The proportional by chance accuracy rate of was computed by


squaring and summing the proportion of cases in each group from
the table of prior probabilities for groups (0.406² + 0.362² + 0.232²
= 0.350).

Prior Probabilities for Groups

Cases Used in Analysis


WELFARE Prior Unweighted Weighted
1 TOO LITTLE .406 56 56.000
2 ABOUT RIGHT .362 50 50.000
3 TOO MUCH .232 32 32.000
Total 1.000 138 138.000
CLASSIFICATION USING THE DISCRIMINANT MODEL:
SW388R7
Data Analysis &
Computers II

Slide 61
criteria for classification accuracy

Classification Resultsb,c

Predicted Group Membership


1 TOO 2 ABOUT
WELFARE LITTLE RIGHT 3 TOO MUCH Total
Original Count 1 TOO LITTLE 43 15 6 64
2 ABOUT RIGHT 26 30 6 62
3 TOO MUCH 17 10 9 36
Ungrouped cases 3 3 2 8
% 1 TOO LITTLE 67.2 23.4 9.4 100.0
2 ABOUT RIGHT 41.9 48.4 9.7 100.0
3 TOO MUCH 47.2 27.8 25.0 100.0
Ungrouped cases 37.5 37.5 25.0 100.0
Cross-validated a Count 1 TOO LITTLE 43 15 6 64
2 The cross-validated
ABOUT RIGHT accuracy
26 rate30 6 62
computed
3 TOO MUCH by SPSS was
17 50.0% 11 8 36
% which was
1 TOO LITTLE greater than or equal to
67.2 23.4 9.4 100.0
the proportional by chance accuracy
2 criteria
ABOUT RIGHT
of 43.7% (1.2541.9 x 35.0%48.4= 9.7 100.0
43.7%).
3 TOO MUCH The criteria for
47.2 30.6 22.2 100.0
classification accuracy is satisfied.
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.
b. 50.6% of original grouped cases correctly classified.
c. 50.0% of cross-validated grouped cases correctly classified.
CLASSIFICATION USING THE DISCRIMINANT MODEL:
SW388R7
Data Analysis &
Computers II

Slide 62
VALIDATION OF THE DISCRIMINANT ANALYSIS

Classification Resultsb,c

Predicted Group Membership


1 TOO 2 ABOUT
WELFARE LITTLE RIGHT 3 TOO MUCH Total
Original Count 1 TOO LITTLE 43 15 6 64
2 ABOUT RIGHT 26 30 6 62
3 TOO MUCH 17 10 9 36
Ungrouped cases 3 3 2 8
% 1 TOO LITTLE 67.2 23.4 9.4 100.0
2 ABOUT RIGHT 41.9 48.4 9.7 100.0
3 TOO MUCH 47.2 27.8 25.0 100.0
Ungrouped cases 37.5 37.5 25.0 100.0
Cross-validated a Count The cross-validated
1 TOO LITTLE accuracy
43 rate
15 is a measure6 64
of the generalizabillity of the discriminant
2 ABOUT RIGHT
analysis 26 classifying
for correctly 30 populations not
6 62
included
3 TOO MUCH in the original
17 model. 11 Since the cross-8 36
% validated
1 TOO LITTLE classification
67.2 accuracy
23.4 rate (50.0%)
9.4 100.0
met or exceeded
2 ABOUT RIGHT
the proportional by chance
41.9
accuracy criteria (43.7%), 48.4
this requirement9.7 for 100.0
3 TOO MUCH
generalizability was
47.2 satisfied.
30.6 22.2 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.
b. 50.6% of original grouped cases correctly classified.
c. 50.0% of cross-validated grouped cases correctly classified.
SW388R7

Answering the problem question - 1


Data Analysis &
Computers II

Slide 63

From the list of variables "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98],
the most useful predictors for distinguishing among groups based on responses to "opinion
about spending on welfare" [natfare] are "number of hours worked in the past week"
[hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These
predictors differentiate survey respondents who thought we spend too much money on welfare
from survey respondents who thought we spend about the right amount of money on welfare
The stepwise discriminant analysis
who, in turn, are differentiated from survey respondents who thought we spend too little
included the three variables identified
money on welfare. as the most useful predictors.

The most important predictor of groups based on responses to opinion about spending on
welfare was number of hours worked in the past week. The second most important predictor of
groups based on responses to opinion about spending on welfare was self-employment. The
third most important predictor of groups based on responses to opinion about spending on
welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than survey respondents who thought we
spend too much or little money on welfare. Survey respondents who thought we spend too
much money on welfare were more likely to be self-employed than survey respondents who
thought we spend too little money on welfare.
SW388R7

Answering the problem question - 2


Data Analysis &
Computers II

Slide 64

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too much money on welfare from survey respondents
who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on
welfare was number We
of hours
found worked in the past
two statistically week. The second most important predictor of
significant
groups based on responses to opinion
discriminant aboutmaking
functions, spending on welfare
it possible to was self-employment. The
third most importantdistinguish
predictor among
of groupsthebased
three on responses
groups defined to opinion about spending on
welfare was highest by theofdependent
year variable.
school completed.
Moreover, the cross-validated classification
Survey respondents who thought
accuracy we spend
surpassed the about the right
by chance amount of money on welfare worked
accuracy
criteria, supporting the utility of the model.
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than survey respondents who thought we
spend too much or little money on welfare. Survey respondents who thought we spend too
much money on welfare were more likely to be self-employed than survey respondents who
thought we spend too little money on welfare.
SW388R7

Answering the problem question - 3


Data Analysis &
Computers II

Slide 65

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too much money on welfare from survey respondents who
The order of importance matched
thought we spend about the right theamount
order ofofentry
money on table
in the welfare
of who, in turn, are differentiated
from survey respondents who thought we Entered/Removed."
"Variables spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on
welfare was number of hours worked in the past week. The second most important
predictor of groups based on responses to opinion about spending on welfare was self-
employment. The third most important predictor of groups based on responses to opinion
about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than survey respondents who thought we
spend too much or little money on welfare. Survey respondents who thought we spend too
much money on welfare were more likely to be self-employed than survey respondents who
thought we spend too little money on welfare.
SW388R7

Answering the problem question - 4


Data Analysis &
Computers II

Slide 66

The most important predictor of groups based on responses to opinion about spending on
welfare was number of hours worked in the past week. The second most important predictor of
groups based on responses to opinion about
We spending on welfare
verified that was self-employment. The
each statement
third most important predictor of groups about
basedthe
on relationship
responses tobetween
opinion about spending on
welfare was highest year of school completed.
predictors and groups was correct.

Survey respondents who thought we spend about the right amount of money on welfare
worked fewer hours in the past week than survey respondents who thought we spend too
much or little money on welfare. Survey respondents who thought we spend about the right
amount of money on welfare had completed more years of school than survey respondents
who thought we spend too much or little money on welfare. Survey respondents who
thought we spend too much money on welfare were more likely to be self-employed than
survey respondents who thought we spend too little money on welfare.

1. True
The answer to the question is true with
2. True with caution caution. A caution is added because of
3. False the inclusion of ordinal level variables. A
caution is added because of a violation
4. Inappropriate application of a statistic of discriminant analysis assumptions.
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 67
level of measurement

The following is a guide to the decision process for answering


problems about the complete discriminant analysis:

Dependent non-metric? No Inappropriate


Independent variables application of
metric or dichotomous? a statistic

Yes
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 68
analyzing missing data

Create missing/valid group variable to


Is any variable missing use in t-tests with other metric
data for more than 5% of Yes variables in the analysis and chi-square
the cases in the data tests with other nonmetric variables in
the analysis.
set?

No

No Probability of t-tests or
chi-square tests <= level
of significance?

Yes

Add caution to interpretation to require


further work to understand pattern

Run baseline discriminant analysis, using method for


including variables identified in the research question.
Record cross-validated classification accuracy for
evaluation of transformations and removal of outliers.
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 69
assumption of normality

If more than one


transformation
satisfies normality,
use one with
smallest skew

Log, square root, or


Metric IV's satisfy No inverse
criteria for a normal
transformation
distribution?
satisfies normality?
Yes

No
Yes
Add caution for Use transformation
violation of normality in revised model

If any variables were transformed for


normality or linearity, substitute
transformed variables in the detection
of outliers.
SW388R7
Data Analysis &
Computers II
Complete discriminant analysis:
Slide 70
detecting outliers

Is the probability of Mahalanobis Yes


D² for any case less than or Exclude outliers
equal to 0.001? from revised model

No

Run revised discriminant


using transformed variables
and omitting outliers.
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 71
picking discriminant model for interpretation

Cross-validated accuracy
for revised discriminant
analysis > accuracy of
Yes baseline by 2% or more? No

Pick discriminant analysis with Pick baseline discriminant


transformations and omitting analysis for interpretation
outliers for interpretation
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 72
sample size

Ratio of cases to No Inappropriate


independent variables at application of
least 5 to 1?
a statistic

Yes

Number of cases in
smallest group greater No Inappropriate
than number of application of
independent variables? a statistic

Yes
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 73
assumption of equal dispersion

Probability of Box's M test Yes Re-run discriminant analysis, using


less than or equal to level of separate-groups covariance matrices
significance for assumptions? for classification

No
No Accuracy rate at least 2%
higher using separate-
groups covariance
matrices?

Yes

Pick discriminant analysis using Pick discriminant analysis using


within-groups covariance for separate-groups covariance for
interpretation interpretation
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 74
usable discriminant model

Sufficient statistically No
significant functions to False
distinguish DV groups?

Yes

Tolerance for all IV’s


greater than 0.10, No
indicating no False
multicollinearity?

Yes
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 75
relationships between IV's and DV

Stepwise method of entry


used to include
independent variables?
Yes

No
Entry order of variables
interpreted correctly?
No

False
Yes

Relationships between No
individual IVs and DV groups False
interpreted correctly?

Yes
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 76
classification accuracy

Cross-validated accuracy is No
25% higher than proportional False
by chance accuracy rate?

Yes
Complete discriminant analysis:
SW388R7
Data Analysis &
Computers II

Slide 77
adding cautions to solution

Satisfies preferred ratio of No


cases to IV's of 20 to 1 True with caution

Yes

Satisfies preferred DV group No


minimum size of 20 cases? True with caution

Yes

DV is non-metric level and IVs No


are interval level or True with caution
dichotomous (not ordinal)?

Yes

True

You might also like