17 views

Uploaded by Lee Hou Yew

research

- Advanced Stata Workshop
- Multinomial and ordinal logistic regression using PROC LOGISTIC
- Linear Regression
- Bmgt230 Exam 3
- HW1_N
- Gary Grudnitski - Valuations of Residential Properties
- Cold Pressing and Supercritical CO2 Extraction.pdf
- SOCIAL FACTORS INFLUENCING THE SMARTPHONE PURCHASE AMONG THE GENERATION Z
- 37-88
- Regression Analysis in Adult Age Estimation
- Logistic Regression
- LRT thesis example
- Malaria Incidence.pptx
- chap22
- Correlation
- CVEN2002 Week11
- L3 Output Wiyono
- Regression
- Trade Financing Summary
- Wayan Adi Putera

You are on page 1of 49

In this exercise, we will work an example of logistic regression as found in the literature:

Sandra L. Hanson and Douglas M. Sloane, "Young Children and Job Satisfaction." Journal

of Marriage and the Family, 54 (November, 1992), 799-811.

The data for this problem is: YoungChildrenJobSatisfaction.Sav.

Slide 1

In this stage, the following issues are addressed:

Relationship to be analyzed

Specifying the dependent and independent variables

Method for including independent variables

Relationship to be analyzed

"We are interested in examining the effect of young children on the job satisfaction of

men and women involved in a variety of work and family roles to see how the presence

of family responsibilities affects their happiness at work. The research is comparative. It

involves contrasts between men and women in different work and marital statuses as

several points in time." (page 800)

Slide 2

The dependent variable is job satisfaction, measured on a four category Likert-scale:

1=Very Satisfied, 2=Moderately Satisfied, 3=A Little Dissatisfied, and 4=Very Dissatisfied.

Because the data does not follow a normal distribution (See page 803-804), the authors

recoded the variable to a dichotomous variable where 1 = Very Satisfied and 0 =

Moderately Satisfied to Very Dissatisfied. The purpose of the analysis, then, is to

determine what factors contribute to a high level of job satisfaction versus some other

level of job satisfaction. With a dichotomous dependent variable, logistic regression

becomes the analytic techniques of choice.

The independent variables are grouped into two categories:

1.

Individual and family characteristics (age, race, education, spouse's work status,

prestige of spouse's occupation, number of children, presence of young children, general

happiness, and satisfaction with family)

2.

Job characteristics (income, job prestige, job authority, job autonomy,

convenience (number of hours worked per week), and past work experience).

The variable presence of young children is important to answering the main question of

the article.

Other variables, which could have been included as independent variables, were used to

divide the sample into subgroups which were compared with each other to answer the

research questions. For example, Sex and Work Status were combined to form a

composite variable WORK_SEX. We will use these variables with the SPSS "Select Cases

command to produce the results for different groups.

Slide 3

With a dichotomous dependent variable and a variety of independent variables, the

statistical technique to use is logistic regression. While we could structure the analysis

to do hierarchical entry of variables (individual, family characteristics, and job

characteristics in block 1 and the presence of young children in block 2), we will use

direct entry of all variables on a single step to conform to the authors analysis.

Slide 4

In this stage, the following issues are addressed:

Missing data analysis

Minimum sample size requirement: 15-20 cases per independent variable

Slide 5

In the missing data analysis, we are looking for a pattern or process whereby the pattern

of missing data could influence the results of the statistical analysis.

The data set for this problem is used for a large number of analyses in the article. Not

all variables and cases are used in each analysis, so it makes sense to conduct the

missing data analysis on the cases and variables to be included in the problem in this

exercise.

We will compute the logistic regression model for 1976-77 married, full-time males as

presented in table 2 on page 807. (Note: this analysis does not include the independent

variables SPOCCUP 'Spouses Occupation' and EVWORK 'Ever Work as Long as One Year').

First, we will exclude the cases not used in this exercise and then we will examine

missing data for the variables used in this exercise.

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Two independent variables have relatively large numbers of missing cases:

JCINCOME 'Job Characteristic - Income' and AUTHORIT 'Job Characteristic - Authority'.

However, all variables have valid data for 90% or more of cases, so no variables will be

excluded for an excessive number of missing cases.

Slide 11

Next, we examine the number of missing variables per case. Of the possible 14 variables

in the analysis (13 independent variables and 1 dependent variable), one cases was

missing half of the variables (7) and should be excluded from the remaining analyses.

Slide 12

About 97.3% of the cases have no missing variables or one missing variable. Of those

cases missing two or more variables, the frequencies for the combinations are 4 or

fewer. There is no evidence of a predominant missing data pattern that will impact the

analysis.

Slide 13

The largest correlation in the matrix of valid/missing data (not shown) is 0.363. None of

the correlations for missing data values are above the weak level, so we can delete

missing cases without fear that we are distorting the solution.

Slide 14

15-20 cases per independent variable

If we accept the SPSS default of listwise deletion of missing data, we will have 538 cases

in the analysis. The ratio of cases to independent variables is 538/13 or 41 to 1. We

meet this requirement.

Slide 15

In this stage, the following issues are addressed:

Incorporating nonmetric data with dummy variables

Representing Curvilinear Effects with Polynomials

Representing Interaction or Moderator Effects

All of the nonmetric variables have recoded into dichotomous dummy-coded variables.

We do not have any evidence of curvilinear effects at this point in the analysis.

We do not have any evidence at this point in the analysis that we should add interaction

or moderator variables.

Young Children and Job Satisfaction

Slide 16

In this stage, the following issues are addressed:

Nonmetric dependent variable with two groups

Metric or dummy-coded independent variables

The dependent variable 'Job satisfaction' was recoded into dichotomous categories.

Marital status, race, spouse's work status, presence of young children, job authority, job

autonomy, and ever worked as long as one year are all coded as dichotomous variables.

Age of respondent, highest year of school completed, prestige of spouse's occupation,

number or children, general happiness, satisfaction with family, income, job prestige,

hours worked (convenience), and year of the survey can be treated as metric variables.

Slide 17

Assessing Overall Fit: Model Estimation

In this stage, the following issues are addressed:

Compute logistic regression model

The steps to obtain a logistic regression analysis are detailed on the following screens.

If the cases to be included in this analysis were not selected in the missing data analysis,

the selection needs to be completed before proceeding.

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Assessing Overall Fit: Assessing Model Fit

In this stage, the following issues are addressed:

Significance test of the model log likelihood (Change in -2LL)

Measures Analogous to R: Cox and Snell R and Nagelkerke R

Hosmer-Lemeshow Goodness-of-fit

Classification matrices

Check for Numerical Problems

Presence of outliers

Slide 26

The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure

like total sums of squares in regression. If our independent variables have a relationship

to the dependent variable, we will improve our ability to predict the dependent variable

accurately, and the log likelihood value will decrease. The initial 2LL value is 742.850

on step 0, before any variables have been added to the model.

Slide 27

The difference between these two measures is the model child-square value (57.153 =

742.850 685.697) that is tested for statistical significance. This test is analogous to the

F-test for R or change in R value in multiple regression which tests whether or not the

improvement in the model associated with the additional variables is statistically

significant.

In this problem the model Chi-Square value of 57.153 has a significance of 0.000, less

than 0.05, so we conclude that there is a significant relationship between the dependent

variable and the set of independent variables.

Slide 28

Measures Analogous to R

The next SPSS outputs indicate the strength of the relationship between the dependent

variable and the independent variables, analogous to the R measures in multiple

regression.

The Cox and Snell R measure operates like R, with higher values indicating greater

model fit. However, this measure is limited in that it cannot reach the maximum value

of 1, so Nagelkerke proposed a modification that had the range from 0 to 1. We will rely

upon Nagelkerke's measure as indicating the strength of the relationship.

Based on the interpretive criteria, we would characterize this model as weak.

Slide 29

of the Dependent Variable

The final measure of model fit is the Hosmer and Lemeshow goodness-of-fit statistic,

which measures the correspondence between the actual and predicted values of the

dependent variable. In this case, better model fit is indicated by a smaller difference in

the observed and predicted classification. A good model fit is indicated by a

nonsignificant chi-square value.

The goodness-of-fit measure has a value of 5.678 which has the desirable outcome of

nonsignificance.

Young Children and Job Satisfaction

Slide 30

The classification matrices in logistic regression serve the same function as the

classification matrices in Young Children and Job Satisfaction, i.e. evaluating the

accuracy of the model.

To evaluate the accuracy of the model, we compute the proportional by chance accuracy

rate and the maximum by chance accuracy rates, if appropriate. Since the sizes of the

groups in this problem are equal to 46% and 54%, the proportional accuracy criterion is

appropriate because we do not have a dominant group.

The proportional by chance accuracy rate is equal to 0.503 (0.463^2 + 0.537^2). A 25%

increase over the by chance accuracy rate would equal 0.628.

Our model accuracy race of 63.2% meets this criterion.

Slide 31

Stacked Histogram

SPSS provides a

visual image of

the classification

accuracy in the

stacked

histogram as

shown below.

To the extent to

which the cases

in one group

cluster on the

left and the

other group

clusters on the

right, the

predictive

accuracy of the

model will be

higher.

Slide 32

There are several numerical problems that can in logistic regression that are not

detected by SPSS or other statistical packages: multicollinearity among the independent

variables, zero cells for a dummy-coded independent variable because all of the

subjects have the same value for the variable, and "complete separation" whereby the

two groups in the dependent event variable can be perfectly separated by scores on one

of the independent variables.

All of these problems produce large standard errors (over 2) for the variables included in

the analysis and very often produce very large B coefficients as well. If we encounter

large standard errors for the predictor variables, we should examine frequency tables,

one-way ANOVAs, and correlations for the variables involved to try to identify the source

of the problem.

The standard

errors and B

coefficients are

not excessively

large, so there is

no evidence of a

numeric problem

with this analysis.

Slide 33

Presence of outliers

There are two outputs to alert us to outliers that we might consider excluding from the

analysis: listing of residuals and saving Cook's distance scores to the data set.

SPSS provides a casewise list of residuals that identify cases whose residual is above or

below a certain number of standard deviation units. Like multiple regression there are a

variety of ways to compute the residual. In logistic regression, the residual is the

difference between the observed probability of the dependent variable event and the

predicted probability based on the model. The standardized residual is the residual

divided by an estimate of its standard deviation. The deviance is calculated by taking

the square root of -2 x the log of the predicted probability for the observed group and

attaching a negative sign if the event did not occur for that case. Large values for

deviance indicate that the model does not fit the case well. The studentized residual

for a case is the change in the model deviance if the case is excluded. Discrepancies

between the deviance and the studentized residual may identify unusual cases. (See the

SPSS chapter on Logistic Regression Analysis for additional details).

In the output for our problem, SPSS listed one cases that have may be considered an

outlier with a studentized residuals greater than 2:

Slide 34

Cooks Distance

SPSS has an option to compute Cook's distance as a measure of influential cases and add

the score to the data editor. I am not aware of a precise formula for determining what

cutoff value should be used, so we will rely on the more traditional method for

interpreting Cook's distance which is to identify cases that either have a score of 1.0 or

higher, or cases which have a Cook's distance substantially different from the other. The

prescribed method for detecting unusually large Cook's distance scores is to create a

scatterplot of Cook's distance scores versus case id.

Slide 35

Slide 36

Slide 37

Horizontal gridlines were added to the scatterplot to aid interpretation. Based on the

gridlines, we can identify four cases with Cook's distances about 0.175 as influential

cases.

After sorting the data set by the

Cook's distance variable, we

identify the four cases as having

id numbers: 99, 1807, 1833, and

1953. None of these cases were

included on the casewise listing

for large studentized residuals.

Based on these outputs, we

identify five cases out of 538 that

are potential outliers. Since the

number of outliers represents

less than 1% of the sample and

none of the outliers are really

extreme, I will opt to retain them

in the analysis.

Slide 38

In this section, we address the following issues:

Identifying the statistically significant predictor variables

Direction of relationship and contribution to dependent variable

Slide 39

The table of variables in the equation identifies for us the predictor variables that have

a statistically significant individual relationship to the dependent variable. Scanning the

'Sig' column, we identify four variables that have a significance level less than

0.05: GENHAPPY 'How Happy Generally', PRESTIGE 'Job Characteristic - Prestige',

CONVENIE 'Job Characteristic - Convenience', and YEAR 'GSS Year for Respondent'.

Slide 40

The sign of the B coefficients indicates whether the predictor variable increased or

decreased the likelihood of belonging to the group of respondents who were very

satisfied with their jobs.

Slide 41

The coefficient signs for the variables GENHAPPY 'How Happy Generally', PRESTIGE 'Job

Characteristic - Prestige', and CONVENIE 'Job Characteristic - Convenience' were all

positive, indicating that a higher score on these variables enhanced the likelihood of

belonging to the group that was very satisfied with their jobs. The coefficient for YEAR

was negative, indicating that job satisfaction has been declining in later years of the

survey.

The magnitude of change associated with each independent variable is given in the odds

ratio column labeled 'Exp (B)'. This column indicates the increased or decreased odds of

belonging to the group that was very satisfied with their jobs.

For each unit increment on the measure of overall happiness, a respondent was 1.76

times more likely to be very satisfied with his or her job. For each unit increment in job

prestige, a subject was 1.02 times as likely to be very satisfied with his or her job. For

each unit increment in job convenience (or hours worked), a subject was 1.02 times as

likely to be very satisfied with his or her job. Finally, for each increase in year, a

subject was 0.65 times as likely to be very satisfied with his or her job, i.e. was less

likely to be satisfied.

Important to the research question raised by the authors is the finding that

CHILDLT6 'Presence of Young Children' did not have a statistically significant impact on

job satisfaction.

Slide 42

In this stage, we are normally concerned with the following issues"

Creating the Selection Variable

Computing the Split-half Analysis

The Output for the Validation Analysis

To validate the logistic regression, we can randomly divide our sample into two groups, a

screening sample and a validation sample. The analysis is computed for the screening

sample and used to predict membership on the dependent variable in the validation

sample. If the model in the screening sample is valid, we would expect that the

accuracy rates for both samples to be about the same.

In the double cross- validation strategy, we reverse the designation of the screening and

validation sample and re-run the analysis. We can then compare the significant

independent variables found for both screening samples. If the two screening analyses

contain a very different set of significant variables, it indicates that the variables might

have achieved significance because of the sample size and not because of the strength

of the relationship. Our findings about these individual variables would that the

predictive utility of these variables is not generalizable.

Slide 43

Slide 44

Compute the Variable to Randomly Split the Sample into Two Halves

Slide 45

Slide 46

for the First Validation Analysis

First, click on the 'Select>>" button to expose the 'Selection Variable:' text box.

Slide 47

for the Second Validation Analysis

Slide 48

Full Model

Split=0

Split=1

Model Chi-Square

57.153, p=.0000

54.386, p<.0001

28.867, p=.0109

Nagelkerke R2

.135

.246

.136

Learning ample

63.20%

72.12%

65.80%

56.51%

59.85%

Validation Sample

Significant

Coefficients

(p < 0.05)

GENHAPPY 'How

Happy Generally'

PRESTIGE 'Job

Characteristic Prestige'

CONVENIE 'Job

Characteristic Convenience'

YEAR 'GSS Year

for Respondent'

GENHAPPY 'How

Happy Generally'

PRESTIGE 'Job

Characteristic Prestige'

CONVENIE 'Job

Characteristic Convenience'

FAMILSAT 'Family

Satisfaction'

CONVENIE 'Job

Characteristic Convenience'

YEAR 'GSS Year

for Respondent'

JCINCOME 'Job

Characteristic Income'

Only one predictor variable, CONVENIE 'Job Characteristic - Convenience, has a stable,

statistically significant relationship to the dependent variable, Job Satisfaction.

In addition, the accuracy that we should evaluate in assessing our model is in the 56% to

59% range rather than in the 63% to 72% range. At this accuracy rate, the model does

not represent a 25% increase over the proportional by chance accuracy rate.

In sum, we do find a relationship between one of the independent variables and job

satisfaction. Our findings should be regarded as tentative or exploratory rather than

definitive because we would not meet the classification accuracy rate required for a

usable model.

Tabachnick and Fidell Sample Problem

Slide 49

- Advanced Stata WorkshopUploaded byHector Garcia
- Multinomial and ordinal logistic regression using PROC LOGISTICUploaded byfffresh
- Linear RegressionUploaded byErico
- Bmgt230 Exam 3Uploaded byNathan Yeh
- HW1_NUploaded byejh2163
- Gary Grudnitski - Valuations of Residential PropertiesUploaded byIrimia Mihai Adrian
- Cold Pressing and Supercritical CO2 Extraction.pdfUploaded byTatiana Dueñas López
- SOCIAL FACTORS INFLUENCING THE SMARTPHONE PURCHASE AMONG THE GENERATION ZUploaded byTJPRC Publications
- 37-88Uploaded byGowri Karthik
- Regression Analysis in Adult Age EstimationUploaded byamapink94
- Logistic RegressionUploaded bypankaj
- LRT thesis exampleUploaded byDave Albero
- Malaria Incidence.pptxUploaded bythelazia
- chap22Uploaded bykishoreparasa
- CorrelationUploaded byeric
- CVEN2002 Week11Uploaded byKai Liu
- L3 Output WiyonoUploaded byIkrom
- RegressionUploaded byIliuta Florea
- Trade Financing SummaryUploaded byAsad Jaman
- Wayan Adi PuteraUploaded byLanang Indra
- Paper_Diolkos of CorinthUploaded byAnonymous kqqWjuCG9
- Valuation of the External Cost Caused by the Environmental Pollution of Three Lakes in Northern GreeceUploaded byAnonymous kqqWjuCG9
- Quantitative Methods ProjectUploaded bytanya gupta
- Qm ProjectUploaded byHarsh Agarwal
- RAPATUploaded byCut Athiya
- regrassionUploaded byrsluna
- 3.Regression2Uploaded byDio Augie Nathaniel
- Regression Model in MatriUploaded byRobin Kumar Biswas
- Stt363chapter5Uploaded byPi
- Volumen forestalUploaded byAdrián Gómez Del Castillo

- Introduction to Blender 3D - Part IUploaded byIrwan Saputra
- tfscghhUploaded byLee Hou Yew
- ABCUploaded byLee Hou Yew
- fggUploaded byLee Hou Yew
- fggUploaded byLee Hou Yew
- AsdUploaded byLee Hou Yew
- Weeed DcUploaded byLee Hou Yew
- Weeed DcUploaded byLee Hou Yew
- Weeed DcUploaded byLee Hou Yew
- Weeed DcUploaded byLee Hou Yew
- Weeed DcUploaded byLee Hou Yew
- Weeed DcUploaded byLee Hou Yew
- ujianUploaded byLee Hou Yew
- edhujUploaded byLee Hou Yew
- Data 1234Uploaded byLee Hou Yew
- etnikalUploaded byLee Hou Yew
- bodoh123Uploaded byLee Hou Yew
- Rasa SayangUploaded byLee Hou Yew
- KnowledgeUploaded byLee Hou Yew
- PlickersCards_2upUploaded byabrahman423
- AllUploaded byLee Hou Yew
- Teacher NotesUploaded byLee Hou Yew
- Includes sUploaded byLee Hou Yew
- Magic ShowUploaded byLee Hou Yew
- Individual Plan NovemberUploaded byLee Hou Yew
- isyarat tangan.pptUploaded byLee Hou Yew
- isyarat tangan.pptUploaded byLee Hou Yew
- isyarat tangan.pptUploaded byLee Hou Yew
- 272897819092805_00000000808Uploaded byLee Hou Yew
- Learning StrategiesUploaded byLee Hou Yew

- Tendencies in Research on Sustainable Development in Management SciencesUploaded byJossimar Perez Peña
- ANOVA Farmakologi SpssUploaded byTia Kurnia Sapta Rini
- CFA Level 1 Review - Quantitative MethodsUploaded byAamirx64
- Final w07soln (1)Uploaded byTrevor Allen Holleron
- Time Series_Eviews GuidelinesUploaded byTanuj Arora
- conceptos basicosUploaded byyesid fabian medina rincon
- approachpaper-minitabtraining-170321082957Uploaded bykhamaludin
- 377989256-KEL738-PDF-EnG-Samsung-Electronics-Using-Affinity-Diagrams-and-Pareto-Charts.pdfUploaded byanita
- Statistic 217Uploaded byAOIIYearbook
- Moving From Time Based Maintenance to Condition Based Maintenance (Webinar)Uploaded byJor Billacis
- 18-ANOVA.docUploaded byApam Benjamin
- Descriptive StatisticsUploaded bykikaykhe
- Regress Ssss i OnUploaded byHassan Khan
- homework 1Uploaded byapi-253411445
- MlrUploaded byJoanne Wong
- Programme_DECHEMA Praxisforum_Big Data Analytics in Process Industry (3)Uploaded byMuhammad Nawaz
- Fitting & Interpreting Linear Models in Rinear Models in RUploaded byReaderRat
- Statistic VocabularyUploaded byAnonymous QUJMWr
- indigenizing mathematics in the okanaganUploaded byapi-246360250
- Book Buying Patterns_USEFULUploaded byDenise Wu
- 302-Assignment4Uploaded byShane Hundley
- Swt549 Data Mining and Business Intelligence Th 1.10 Ac26Uploaded bynetgalaxy2010
- Two SampleUploaded byRohit Singh
- Linear Regression Models for Panel Data Using SAS, STATA, LIMDEP and SPSSUploaded byLuísa Martins
- W4911Y09Paper1V2Uploaded bygautambastian
- perhats cv cbdUploaded byapi-383761026
- Jurnal Skripsi Vina Agustina (C2A008250)Uploaded byAgung Nurcahyo
- How to Critique and Analyze a Quantitative Research ReportUploaded byChristian Eduard de Dios
- 11651 2. Assump CLRM Remedial Measures (2)Uploaded byKshitij Tripathi
- 3282-7637-1-PBUploaded byDhiah Nita Larasati