You are on page 1of 49

# A Problem from the Research Literature

In this exercise, we will work an example of logistic regression as found in the literature:
Sandra L. Hanson and Douglas M. Sloane, "Young Children and Job Satisfaction." Journal
of Marriage and the Family, 54 (November, 1992), 799-811.
The data for this problem is: YoungChildrenJobSatisfaction.Sav.

Slide 1

## Stage One: Define the Research Problem

In this stage, the following issues are addressed:
Relationship to be analyzed
Specifying the dependent and independent variables
Method for including independent variables

Relationship to be analyzed
"We are interested in examining the effect of young children on the job satisfaction of
men and women involved in a variety of work and family roles to see how the presence
of family responsibilities affects their happiness at work. The research is comparative. It
involves contrasts between men and women in different work and marital statuses as
several points in time." (page 800)

Slide 2

## Specifying the dependent and independent variables

The dependent variable is job satisfaction, measured on a four category Likert-scale:
1=Very Satisfied, 2=Moderately Satisfied, 3=A Little Dissatisfied, and 4=Very Dissatisfied.
Because the data does not follow a normal distribution (See page 803-804), the authors
recoded the variable to a dichotomous variable where 1 = Very Satisfied and 0 =
Moderately Satisfied to Very Dissatisfied. The purpose of the analysis, then, is to
determine what factors contribute to a high level of job satisfaction versus some other
level of job satisfaction. With a dichotomous dependent variable, logistic regression
becomes the analytic techniques of choice.
The independent variables are grouped into two categories:
1.
Individual and family characteristics (age, race, education, spouse's work status,
prestige of spouse's occupation, number of children, presence of young children, general
happiness, and satisfaction with family)
2.
Job characteristics (income, job prestige, job authority, job autonomy,
convenience (number of hours worked per week), and past work experience).
The variable presence of young children is important to answering the main question of
the article.
Other variables, which could have been included as independent variables, were used to
divide the sample into subgroups which were compared with each other to answer the
research questions. For example, Sex and Work Status were combined to form a
composite variable WORK_SEX. We will use these variables with the SPSS "Select Cases
command to produce the results for different groups.

Slide 3

## Method for including independent variables

With a dichotomous dependent variable and a variety of independent variables, the
statistical technique to use is logistic regression. While we could structure the analysis
to do hierarchical entry of variables (individual, family characteristics, and job
characteristics in block 1 and the presence of young children in block 2), we will use
direct entry of all variables on a single step to conform to the authors analysis.

Slide 4

## Stage 2: Develop the Analysis Plan: Sample Size Issues

In this stage, the following issues are addressed:
Missing data analysis
Minimum sample size requirement: 15-20 cases per independent variable

Slide 5

## Missing data analysis

In the missing data analysis, we are looking for a pattern or process whereby the pattern
of missing data could influence the results of the statistical analysis.
The data set for this problem is used for a large number of analyses in the article. Not
all variables and cases are used in each analysis, so it makes sense to conduct the
missing data analysis on the cases and variables to be included in the problem in this
exercise.
We will compute the logistic regression model for 1976-77 married, full-time males as
presented in table 2 on page 807. (Note: this analysis does not include the independent
variables SPOCCUP 'Spouses Occupation' and EVWORK 'Ever Work as Long as One Year').
First, we will exclude the cases not used in this exercise and then we will examine
missing data for the variables used in this exercise.

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

## Number of Valid and Missing Cases per Variable

Two independent variables have relatively large numbers of missing cases:
JCINCOME 'Job Characteristic - Income' and AUTHORIT 'Job Characteristic - Authority'.
However, all variables have valid data for 90% or more of cases, so no variables will be
excluded for an excessive number of missing cases.

Slide 11

## Frequency of Cases that are Missing Variables

Next, we examine the number of missing variables per case. Of the possible 14 variables
in the analysis (13 independent variables and 1 dependent variable), one cases was
missing half of the variables (7) and should be excluded from the remaining analyses.

Slide 12

## Distribution of Patterns of Missing Data

About 97.3% of the cases have no missing variables or one missing variable. Of those
cases missing two or more variables, the frequencies for the combinations are 4 or
fewer. There is no evidence of a predominant missing data pattern that will impact the
analysis.

Slide 13

## Correlation Matrix of Valid/Missing Dichotomous Variables

The largest correlation in the matrix of valid/missing data (not shown) is 0.363. None of
the correlations for missing data values are above the weak level, so we can delete
missing cases without fear that we are distorting the solution.

Slide 14

## Minimum sample size requirement:

15-20 cases per independent variable
If we accept the SPSS default of listwise deletion of missing data, we will have 538 cases
in the analysis. The ratio of cases to independent variables is 538/13 or 41 to 1. We
meet this requirement.

Slide 15

## Stage 2: Develop the Analysis Plan: Measurement Issues:

In this stage, the following issues are addressed:
Incorporating nonmetric data with dummy variables
Representing Curvilinear Effects with Polynomials
Representing Interaction or Moderator Effects

## Incorporating Nonmetric Data with Dummy Variables

All of the nonmetric variables have recoded into dichotomous dummy-coded variables.

## Representing Curvilinear Effects with Polynomials

We do not have any evidence of curvilinear effects at this point in the analysis.

## Representing Interaction or Moderator Effects

We do not have any evidence at this point in the analysis that we should add interaction
or moderator variables.
Young Children and Job Satisfaction

Slide 16

## Stage 3: Evaluate Underlying Assumptions

In this stage, the following issues are addressed:
Nonmetric dependent variable with two groups
Metric or dummy-coded independent variables

## Nonmetric dependent variable having two groups

The dependent variable 'Job satisfaction' was recoded into dichotomous categories.

## Metric or dummy-coded independent variables

Marital status, race, spouse's work status, presence of young children, job authority, job
autonomy, and ever worked as long as one year are all coded as dichotomous variables.
Age of respondent, highest year of school completed, prestige of spouse's occupation,
number or children, general happiness, satisfaction with family, income, job prestige,
hours worked (convenience), and year of the survey can be treated as metric variables.

Slide 17

## Stage 4: Estimation of Logistic Regression and

Assessing Overall Fit: Model Estimation
In this stage, the following issues are addressed:
Compute logistic regression model

## Compute the logistic regression

The steps to obtain a logistic regression analysis are detailed on the following screens.
If the cases to be included in this analysis were not selected in the missing data analysis,
the selection needs to be completed before proceeding.

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

## Stage 4: Estimation of Logistic Regression and

Assessing Overall Fit: Assessing Model Fit
In this stage, the following issues are addressed:
Significance test of the model log likelihood (Change in -2LL)
Measures Analogous to R: Cox and Snell R and Nagelkerke R
Hosmer-Lemeshow Goodness-of-fit
Classification matrices
Check for Numerical Problems
Presence of outliers

Slide 26

## Initial statistics before independent variables are included

The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure
like total sums of squares in regression. If our independent variables have a relationship
to the dependent variable, we will improve our ability to predict the dependent variable
accurately, and the log likelihood value will decrease. The initial 2LL value is 742.850
on step 0, before any variables have been added to the model.

Slide 27

## Significance test of the model log likelihood

The difference between these two measures is the model child-square value (57.153 =
742.850 685.697) that is tested for statistical significance. This test is analogous to the
F-test for R or change in R value in multiple regression which tests whether or not the
improvement in the model associated with the additional variables is statistically
significant.

In this problem the model Chi-Square value of 57.153 has a significance of 0.000, less
than 0.05, so we conclude that there is a significant relationship between the dependent
variable and the set of independent variables.

## Young Children and Job Satisfaction

Slide 28

Measures Analogous to R
The next SPSS outputs indicate the strength of the relationship between the dependent
variable and the independent variables, analogous to the R measures in multiple
regression.

The Cox and Snell R measure operates like R, with higher values indicating greater
model fit. However, this measure is limited in that it cannot reach the maximum value
of 1, so Nagelkerke proposed a modification that had the range from 0 to 1. We will rely
upon Nagelkerke's measure as indicating the strength of the relationship.
Based on the interpretive criteria, we would characterize this model as weak.

Slide 29

## Correspondence of Actual and Predicted Values

of the Dependent Variable
The final measure of model fit is the Hosmer and Lemeshow goodness-of-fit statistic,
which measures the correspondence between the actual and predicted values of the
dependent variable. In this case, better model fit is indicated by a smaller difference in
the observed and predicted classification. A good model fit is indicated by a
nonsignificant chi-square value.

The goodness-of-fit measure has a value of 5.678 which has the desirable outcome of
nonsignificance.
Young Children and Job Satisfaction

Slide 30

## The Classification Matrices

The classification matrices in logistic regression serve the same function as the
classification matrices in Young Children and Job Satisfaction, i.e. evaluating the
accuracy of the model.

To evaluate the accuracy of the model, we compute the proportional by chance accuracy
rate and the maximum by chance accuracy rates, if appropriate. Since the sizes of the
groups in this problem are equal to 46% and 54%, the proportional accuracy criterion is
appropriate because we do not have a dominant group.
The proportional by chance accuracy rate is equal to 0.503 (0.463^2 + 0.537^2). A 25%
increase over the by chance accuracy rate would equal 0.628.
Our model accuracy race of 63.2% meets this criterion.

Slide 31

Stacked Histogram
SPSS provides a
visual image of
the classification
accuracy in the
stacked
histogram as
shown below.
To the extent to
which the cases
in one group
cluster on the
left and the
other group
clusters on the
right, the
predictive
accuracy of the
model will be
higher.

Slide 32

## Check for Numerical Problems

There are several numerical problems that can in logistic regression that are not
detected by SPSS or other statistical packages: multicollinearity among the independent
variables, zero cells for a dummy-coded independent variable because all of the
subjects have the same value for the variable, and "complete separation" whereby the
two groups in the dependent event variable can be perfectly separated by scores on one
of the independent variables.
All of these problems produce large standard errors (over 2) for the variables included in
the analysis and very often produce very large B coefficients as well. If we encounter
large standard errors for the predictor variables, we should examine frequency tables,
one-way ANOVAs, and correlations for the variables involved to try to identify the source
of the problem.
The standard
errors and B
coefficients are
not excessively
large, so there is
no evidence of a
numeric problem
with this analysis.

## Young Children and Job Satisfaction

Slide 33

Presence of outliers
There are two outputs to alert us to outliers that we might consider excluding from the
analysis: listing of residuals and saving Cook's distance scores to the data set.
SPSS provides a casewise list of residuals that identify cases whose residual is above or
below a certain number of standard deviation units. Like multiple regression there are a
variety of ways to compute the residual. In logistic regression, the residual is the
difference between the observed probability of the dependent variable event and the
predicted probability based on the model. The standardized residual is the residual
divided by an estimate of its standard deviation. The deviance is calculated by taking
the square root of -2 x the log of the predicted probability for the observed group and
attaching a negative sign if the event did not occur for that case. Large values for
deviance indicate that the model does not fit the case well. The studentized residual
for a case is the change in the model deviance if the case is excluded. Discrepancies
between the deviance and the studentized residual may identify unusual cases. (See the
SPSS chapter on Logistic Regression Analysis for additional details).
In the output for our problem, SPSS listed one cases that have may be considered an
outlier with a studentized residuals greater than 2:

## Young Children and Job Satisfaction

Slide 34

Cooks Distance
SPSS has an option to compute Cook's distance as a measure of influential cases and add
the score to the data editor. I am not aware of a precise formula for determining what
cutoff value should be used, so we will rely on the more traditional method for
interpreting Cook's distance which is to identify cases that either have a score of 1.0 or
higher, or cases which have a Cook's distance substantially different from the other. The
prescribed method for detecting unusually large Cook's distance scores is to create a
scatterplot of Cook's distance scores versus case id.

Slide 35

Slide 36

Slide 37

## The Scatterplot of Cook's Distances

Horizontal gridlines were added to the scatterplot to aid interpretation. Based on the
gridlines, we can identify four cases with Cook's distances about 0.175 as influential
cases.
After sorting the data set by the
Cook's distance variable, we
identify the four cases as having
id numbers: 99, 1807, 1833, and
1953. None of these cases were
included on the casewise listing
for large studentized residuals.
Based on these outputs, we
identify five cases out of 538 that
are potential outliers. Since the
number of outliers represents
less than 1% of the sample and
none of the outliers are really
extreme, I will opt to retain them
in the analysis.

Slide 38

## Stage 5: Interpret the Results

In this section, we address the following issues:
Identifying the statistically significant predictor variables
Direction of relationship and contribution to dependent variable

Slide 39

## Identifying the statistically significant predictor variables

The table of variables in the equation identifies for us the predictor variables that have
a statistically significant individual relationship to the dependent variable. Scanning the
'Sig' column, we identify four variables that have a significance level less than
0.05: GENHAPPY 'How Happy Generally', PRESTIGE 'Job Characteristic - Prestige',
CONVENIE 'Job Characteristic - Convenience', and YEAR 'GSS Year for Respondent'.

Slide 40

## Direction of relationship and contribution to dependent variable - 1

The sign of the B coefficients indicates whether the predictor variable increased or
decreased the likelihood of belonging to the group of respondents who were very
satisfied with their jobs.

Slide 41

## Direction of relationship and contribution to dependent variable - 2

The coefficient signs for the variables GENHAPPY 'How Happy Generally', PRESTIGE 'Job
Characteristic - Prestige', and CONVENIE 'Job Characteristic - Convenience' were all
positive, indicating that a higher score on these variables enhanced the likelihood of
belonging to the group that was very satisfied with their jobs. The coefficient for YEAR
was negative, indicating that job satisfaction has been declining in later years of the
survey.
The magnitude of change associated with each independent variable is given in the odds
ratio column labeled 'Exp (B)'. This column indicates the increased or decreased odds of
belonging to the group that was very satisfied with their jobs.
For each unit increment on the measure of overall happiness, a respondent was 1.76
times more likely to be very satisfied with his or her job. For each unit increment in job
prestige, a subject was 1.02 times as likely to be very satisfied with his or her job. For
each unit increment in job convenience (or hours worked), a subject was 1.02 times as
likely to be very satisfied with his or her job. Finally, for each increase in year, a
subject was 0.65 times as likely to be very satisfied with his or her job, i.e. was less
likely to be satisfied.
Important to the research question raised by the authors is the finding that
CHILDLT6 'Presence of Young Children' did not have a statistically significant impact on
job satisfaction.

Slide 42

## Stage 6: Validate The Model

In this stage, we are normally concerned with the following issues"
Creating the Selection Variable
Computing the Split-half Analysis
The Output for the Validation Analysis

## Conducting the Validation Analysis

To validate the logistic regression, we can randomly divide our sample into two groups, a
screening sample and a validation sample. The analysis is computed for the screening
sample and used to predict membership on the dependent variable in the validation
sample. If the model in the screening sample is valid, we would expect that the
accuracy rates for both samples to be about the same.
In the double cross- validation strategy, we reverse the designation of the screening and
validation sample and re-run the analysis. We can then compare the significant
independent variables found for both screening samples. If the two screening analyses
contain a very different set of significant variables, it indicates that the variables might
have achieved significance because of the sample size and not because of the strength
of the relationship. Our findings about these individual variables would that the
predictive utility of these variables is not generalizable.

Slide 43

## Young Children and Job Satisfaction

Slide 44

Compute the Variable to Randomly Split the Sample into Two Halves

Slide 45

Slide 46

## Specify the Value of the Selection Variable

for the First Validation Analysis
First, click on the 'Select>>" button to expose the 'Selection Variable:' text box.

Slide 47

## Specify the Value of the Selection Variable

for the Second Validation Analysis

Slide 48

Full Model

Split=0

Split=1

Model Chi-Square

57.153, p=.0000

54.386, p<.0001

28.867, p=.0109

Nagelkerke R2

.135

.246

.136

Learning ample

63.20%

72.12%

65.80%

56.51%

59.85%

## Accuracy Rate for

Validation Sample

Significant
Coefficients
(p < 0.05)

GENHAPPY 'How
Happy Generally'
PRESTIGE 'Job
Characteristic Prestige'
CONVENIE 'Job
Characteristic Convenience'
YEAR 'GSS Year
for Respondent'

GENHAPPY 'How
Happy Generally'
PRESTIGE 'Job
Characteristic Prestige'
CONVENIE 'Job
Characteristic Convenience'
FAMILSAT 'Family
Satisfaction'

CONVENIE 'Job
Characteristic Convenience'
YEAR 'GSS Year
for Respondent'
JCINCOME 'Job
Characteristic Income'

Only one predictor variable, CONVENIE 'Job Characteristic - Convenience, has a stable,
statistically significant relationship to the dependent variable, Job Satisfaction.
In addition, the accuracy that we should evaluate in assessing our model is in the 56% to
59% range rather than in the 63% to 72% range. At this accuracy rate, the model does
not represent a 25% increase over the proportional by chance accuracy rate.
In sum, we do find a relationship between one of the independent variables and job
satisfaction. Our findings should be regarded as tentative or exploratory rather than
definitive because we would not meet the classification accuracy rate required for a
usable model.
Tabachnick and Fidell Sample Problem

Slide 49