# Introduction Probability, Odds, and Odds Ratios Logistic Regression ROC Curves

SAS Logistic Regression
Jason Brinkley - Department of Biostatistics

February 2, 2009

Jason Brinkley - Department of Biostatistics

SAS Logistic Regression

Introduction Probability, Odds, and Odds Ratios Logistic Regression ROC Curves

In traditional multiple regression, the intent is to study what eﬀect diﬀerent covariates have on a quantitative response. There are many scenarios when the main question of interest involves a dichotomous response: Yes/No, Success/Fail, Sick/Well, etc. Traditional methods fail to adequately model this kind of data, let’s look at an example.

Jason Brinkley - Department of Biostatistics

SAS Logistic Regression

Introduction Probability, Odds, and Odds Ratios Logistic Regression ROC Curves

Accident Data

(From Cody) Let’s say we want to see if age, vision status, driver education, and gender can be used to predict whether a person had an accident in the past year. Consider a sample of such individuals (see the website for related ﬁles). So we can import the data from Excel using the Proc Import statement.

Jason Brinkley - Department of Biostatistics

SAS Logistic Regression

Jason Brinkley . 0 1 0 1 = = = = ’No Problem’ ’Some Problem’. and Odds Ratios Logistic Regression ROC Curves SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> PROC IMPORT OUT= WORK. USEDATE=YES. SCANTIME=YES. Odds. PROC FORMAT. ’No’ ’Yes’. SCANTEXT=YES. GETNAMES=YES. VALUE VISION VALUE YES_NO RUN. RANGE="accident\$".Introduction Probability.Linear and Logistic Regression\Cody DBMS=EXCEL REPLACE.Department of Biostatistics SAS Logistic Regression . RUN.ACCIDENT DATAFILE= "U:\SAS Workshop\Lecture 3 . MIXED=NO.

SAS> FORMAT ACCIDENT DRIVER_ED YES_NO. and Odds Ratios Logistic Regression ROC Curves SAS> DATA LOGISTIC. SAS> VISION VISION. Jason Brinkley ..Introduction Probability. SAS> SAS> LABEL SAS> ACCIDENT = ’Accident in Last Year?’ SAS> AGE = ’Age of Driver’ SAS> VISION = ’Vision Problem?’ SAS> DRIVER_ED = ’Driver Education?’.Department of Biostatistics SAS Logistic Regression . SAS> RUN. Odds. SAS> SAS> Set Accident.

SAS> RUN. and Odds Ratios Logistic Regression ROC Curves SAS> PROC PRINT DATA=LOGISTIC(OBS=5).Introduction Probability. Odds.Department of Biostatistics SAS Logistic Regression . Driver_ Ed No Yes No No No Obs 1 2 3 4 5 Accident Yes Yes Yes No Yes Age 16 17 17 17 18 Vision Some Problem Some Problem No Problem No Problem Some Problem Gender M M M M M Jason Brinkley .

SAS> PLOT ACCIDENT*AGE. Odds.Introduction Probability. Jason Brinkley . and Odds Ratios Logistic Regression ROC Curves Graph Age Versus Accident SAS> PROC GPLOT.Department of Biostatistics SAS Logistic Regression . SAS> RUN.

Department of Biostatistics SAS Logistic Regression . and Odds Ratios Logistic Regression ROC Curves Jason Brinkley .Introduction Probability. Odds.

and Odds Ratios Logistic Regression ROC Curves Just by examining the graph we can see why traditional models will fail here. Jason Brinkley . and gender have an impact on the PROBABILITY of an accident. Odds.Introduction Probability.Department of Biostatistics SAS Logistic Regression . What we are really interested in is whether age. driver education. Since what are really interested in is modeling probabilities diﬀerent techniques need to be used. vision status.

in order to understand this type of regression we should ﬁrst talk about odds and odds ratios.Introduction Probability. Odds. Jason Brinkley . then the odds of having an accident are given by P 1−P Sometimes we will ﬁnd it important to go from odds back to probabilities so also note Odds = P= . and Odds Ratios Logistic Regression ROC Curves Probability and Odds Logistic regression is a good way to model this type of data. Let’s say P is the probability of an accident.Department of Biostatistics SAS Logistic Regression Odds 1 + Odds .

11. the ratio of the odds is 3. Jason Brinkley . Say the chance of having a wreck among people with poor vision is 30% and among people with good vision it’s 10%. So the odds of an accident have almost quadrupled for people with poor vision problems. Odds.91.e. diﬀerent genders or diﬀerent vision problems) we often use odds ratios. Odds and odds ratios are easier to work with mathematically than pure probabilities.Department of Biostatistics SAS Logistic Regression . The corresponding odds are 0. and Odds Ratios Logistic Regression ROC Curves Odds Ratios To compare the odds of having an accident between diﬀerent groups (i.Introduction Probability. especially in the scenarios where you have a rare event.43 and 0.

..Introduction Probability.... We do logistic regression modeling in SAS with Proc Logistic and it will have many parallels with both Proc Reg and Proc GLM in terms of code format. and Odds Ratios Logistic Regression ROC Curves Logistic Regression Logistic regression ﬁts a model like log(Odds) = β0 + β1 X1 + β2 X2 + . are the covariates of interest. X2 . Odds.. Where X1 . Jason Brinkley . Since we have 1 response and 4 potential predictors let’s do a logistic regression with all of our data.Department of Biostatistics SAS Logistic Regression .

Odds. CLASS GENDER.Department of Biostatistics SAS Logistic Regression . RUN. Jason Brinkley . TITLE "Predicting Accidents Using Logistic Regression". QUIT. *Always use the DESCENDING option. and Odds Ratios Logistic Regression ROC Curves Example SAS> SAS> SAS> SAS> SAS> SAS> SAS> PROC LOGISTIC DATA=LOGISTIC DESCENDING.Introduction Probability. MODEL ACCIDENT = AGE VISION DRIVER_ED GENDER.

Class Level Information Design Variables 1 -1 45 45 Accident in Last Year? Class Gender Value F M Jason Brinkley . and Odds Ratios Logistic Regression ROC Curves Predicting Accidents Using Logistic Regression The LOGISTIC Procedure Model Information Data Set WORK.LOGISTIC Response Variable Accident Number of Response Levels 2 Model binary logit Optimization Technique Fisher’s scoring Number of Observations Read Number of Observations Used Response Profile Ordered Total Value Accident Frequency 1 Yes 25 2 No 20 Probability modeled is Accident=’Yes’.Department of Biostatistics SAS Logistic Regression . Odds.Introduction Probability.

128 49.0220 0.0776 1.0170 5.0127 0.0242 0.4891 5.4432 9.633 61.0608 Type 3 Analysis of Effects Wald Chi-Square 0.Introduction Probability.3109 Jason Brinkley .Department of Biostatistics SAS Logistic Regression . and Odds Ratios Logistic Regression ROC Curves Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied.827 65.827 Predicting Accidents Using Logistic Regression The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square 12.0132 DF 4 4 4 Pr > ChiSq 0.094 68.8962 0.7321 11.0270 Effect Age Vision Driver_Ed Gender DF 1 1 1 1 Pr > ChiSq 0.0191 0. Model Fit Statistics Intercept and Covariates 59.094 Criterion AIC SC -2 Log L Intercept Only 63. Odds.

574 Percent Tied 1.040 0.00247 1.487 9.6758 -1.315 21.3847 Pr > ChiSq 0.3796 Wald Chi-Square 0.0170 5.0548 0.0169 0.180 2.286 Pairs 500 c 0.7160 0.709 0.0191 0.0270 Parameter Intercept Age Vision Driver_Ed Gender F DF 1 1 1 1 1 Estimate 0.559 Effect Age Vision Driver_Ed Gender F vs M Predicting Accidents Using Logistic Regression The LOGISTIC Procedure Association of Predicted Probabilities and Observed Responses Percent Concordant 77.7615 0.8965 0.8962 0.Introduction Probability.4 Tau-a 0.4891 5.158 95% Wald Confidence Limits 0.966 1.783 Jason Brinkley .040 1.566 Percent Discordant 21.Department of Biostatistics SAS Logistic Regression . and Odds Ratios Logistic Regression ROC Curves Analysis of Maximum Likelihood Estimates Standard Error 1.7153 0.1373 0. Odds.6 Somers’ D 0.0242 0.800 0.0 Gamma 0.3109 Odds Ratio Estimates Point Estimate 1.002 5.0190 0.0776 1.343 0.

SAS> QUIT.Introduction Probability. SAS> MODEL ACCIDENT = AGE VISION DRIVER_ED GENDER/ SAS> SELECTION = BACKWARD. Odds. and Odds Ratios Logistic Regression ROC Curves Model Selection SAS> PROC LOGISTIC DATA=LOGISTIC DESCENDING. SAS> TITLE "Predicting Accidents Using Logistic Regression".Department of Biostatistics SAS Logistic Regression . SAS> RUN. SAS> CLASS GENDER. Jason Brinkley .

0414 1 1.1110 0.0 Gamma 0.093 0.8389 0.7037 4.8 Tau-a 0.8962 2 Gender 1 2 1. and Odds Ratios Logistic Regression ROC Curves Predicting Accidents Using Logistic Regression The LOGISTIC Procedure Summary of Backward Elimination Effect Number Wald Step Removed DF In Chi-Square Pr > ChiSq 1 Age 1 3 0.886 Variable Label Age of Driver Gender Parameter Intercept Vision Driver_Ed Pr > ChiSq 0.532 Percent Discordant 14.1313 0.2 Somers’ D 0.Introduction Probability.9113 0.223 0.0330 Effect Vision Driver_Ed Association of Predicted Probabilities and Observed Responses Percent Concordant 67.7049 5.269 Pairs 500 c 0.9113 1 -1.5440 0.0170 0.2875 Type 3 Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq Vision 1 5.0330 Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 0.Department of Biostatistics SAS Logistic Regression .7137 0.5457 0.0150 0.5000 0.655 Percent Tied 18.766 Jason Brinkley .0150 Driver_Ed 1 4.056 0.394 22.5440 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 5.550 1. Odds.

Introduction Probability. VBAR AGE / MIDPOINTS=10 TO 90 BY 10 GROUP=ACCIDENT. PATTERN COLOR=BLACK VALUE=EMPTY. TITLE "Distribution of Ages by Accident Status". PROC GCHART DATA=LOGISTIC. RUN.Department of Biostatistics SAS Logistic Regression . and Odds Ratios Logistic Regression ROC Curves Age is not signiﬁcant? Our regression models seem to indicate that age is not a signiﬁcant covariate. SAS> SAS> SAS> SAS> SAS> SAS> SAS> OPTIONS PS=24. Odds. this seems counter intuitive. Jason Brinkley . Let’s explore the data.

Department of Biostatistics SAS Logistic Regression .Introduction Probability. and Odds Ratios Logistic Regression ROC Curves FREQUENCY 8 7 6 5 4 3 2 1 0 123456789 000000000 No 123456789 Age of Driver 000000000 Yes Accident in Last Year? Jason Brinkley . Odds.

SET LOGISTIC. IF AGE GE 20 AND AGE LE 65 THEN AGEGROUP = 0. Let’s focus on those people by making a new age group variable that indicates whether a person is between 20 and 65 or not. SAS> SAS> SAS> SAS> SAS> SAS> SAS> DATA LOGISTIC. RUN. and Odds Ratios Logistic Regression ROC Curves Spike in Young/Old People There seems to be a spike in accidents in the young and old groups. ELSE AGEGROUP = 1. *CREATE AGE GROUPS.Department of Biostatistics SAS Logistic Regression . Jason Brinkley .Introduction Probability. Odds.

SAS> *THIS NEXT LINE CHANGES THE REFERENCE GROUP.Department of Biostatistics SAS Logistic Regression .Introduction Probability. SAS> CLASS GENDER (PARAM=REF REF=’F’). SAS> MODEL ACCIDENT = AGEGROUP VISION DRIVER_ED GENDER / SAS> SELECTION=BACKWARD. SAS> TITLE "Predicting Accidents Using Logistic Regression". Odds. and Odds Ratios Logistic Regression ROC Curves Model Selection SAS> PROC LOGISTIC DATA=LOGISTIC DESCENDING. SAS> QUIT. Jason Brinkley . SAS> RUN.

and Odds Ratios Logistic Regression ROC Curves NOTE: No (additional) effects met the 0.5854 5.1611 0.9265 Predicting Accidents Using Logistic Regression The LOGISTIC Procedure Odds Ratio Estimates Pr > ChiSq 0.0070 0.805 41.0227 0.3334 0.359 Jason Brinkley .0307 Predicting Accidents Using Logistic Regression The LOGISTIC Procedure Type 3 Analysis of Effects Wald Chi-Square 7.05 significance level for removal from the model.8014 7.209 21.083 1.1886 AGEGROUP 1 2.7325 4.3552 0.1541 Effect AGEGROUP Vision DF 1 1 Pr > ChiSq 0.680 1.Department of Biostatistics SAS Logistic Regression .0264 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Intercept 1 -1. Odds.0264 Point 95% Wald Effect Estimate Confidence Limits AGEGROUP 8.Introduction Probability.0070 0.2711 Vision 1 1.6258 0.9265 Variable Label Gender Driver Education? Pr > ChiSq 0.2711 4. Summary of Backward Elimination Effect Number Wald Step Removed DF In Chi-Square 1 Gender 1 3 0.756 Vision 5.8548 2 Driver_Ed 1 2 2.

OUTPUT OUT=PREDICTED P=PHAT LOWER=LCL UPPER=UCL. QUIT. Odds. and Odds Ratios Logistic Regression ROC Curves Final Model SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> SAS> ods graphics on.Introduction Probability. ods graphics off. RUN.10) OUTROC=ROC. MODEL ACCIDENT =VISION AGEGROUP/ CTABLE PPROB =(0 to 1 by . TITLE "Predicting Accidents Using Logistic Regression". PROC LOGISTIC DATA=LOGISTIC DESCENDING. Jason Brinkley .Department of Biostatistics SAS Logistic Regression .

69586 0.69586 0.Introduction Probability. and Odds Ratios Logistic Regression ROC Curves Predicted Probabilities From our ﬁnal model we can output a new dataset that will have our original data and our predicted probabilites.70197 0. D r i v e r _ E d No Yes No No No O b s 1 2 3 4 5 A c c i d e n t Yes Yes Yes No Yes A g e 16 17 17 17 18 V i s i o n Some Problem Some Problem No Problem No Problem Some Problem G e n d e r M M M M M A G E G R O U P 1 1 1 1 1 _ L E V E L _ Yes Yes Yes Yes Yes P H A T 0. SAS> RUN.90197 0. SAS> TITLE ’Predicted Probabilities and 95% Confidence Limits’.70197 U C L 0. Odds. SAS> PROC PRINT DATA=PREDICTED(OBS=5).92082 0.98288 0.98288 Jason Brinkley .70197 0.Department of Biostatistics SAS Logistic Regression .98288 0.92082 L C L 0.36264 0.92082 0.36264 0.90197 0.

Introduction Probability. it is not used in general practice. Some alternatives to looking at model ﬁt. and Odds Ratios Logistic Regression ROC Curves Assessing your models ﬁt and prediction ability While there does exist a generalized R square measure for these types of models. is looking at percent of concordant and discordant between predicted probabilites and observed response. Another more popular option is called a receiver opererating characteristic curve (ROC Curve) Jason Brinkley .Department of Biostatistics SAS Logistic Regression . Odds.

You may assume that from our given model that we would say anyone with a predicted probability of 0. Jason Brinkley . By changing that cutoﬀ we should be able to predict more accidents. Choosing 0.Department of Biostatistics SAS Logistic Regression . but we will increase our false positives. this table shows us how well our model performs under various prediction probability cutoﬀs.50 is somewhat arbitrary and perhaps we want to look at cases where the cutoﬀ is larger or smaller than 0.50. Odds.50 or greater is likely to be in an accident. and Odds Ratios Logistic Regression ROC Curves Classiﬁcation Table Note that in our last block of code we have listed a classiﬁcation table.Introduction Probability.

100 0.300 0.7 71.7 26.5 44.0 21.9 68.0 100.2 55.0 . Odds.0 48.000 0.6 46.800 0.0 44.0 60.1 71.4 100.1 57.0 30.0 100.Department of Biostatistics SAS Logistic Regression .400 0.500 0.8 84.0 0.0 55.4 44.7 26.600 0.0 85.0 100.8 62. .0 26.0 0.6 Jason Brinkley .700 0.0 55.Introduction Probability. 100.2 41.0 44.900 1.4 False NEG . Prob Level 0.False tivity ficity POS 100.0 55.0 30. and Odds Ratios Logistic Regression ROC Curves Predicting Accidents Using Logistic Regression The LOGISTIC Procedure Classification Table Correct NonEvent Event 25 0 25 0 21 0 21 11 21 11 21 11 15 11 11 17 11 20 11 20 0 20 Incorrect NonEvent Event 20 0 20 0 20 4 9 4 9 4 9 4 9 10 3 14 0 14 0 14 0 25 Percentages Sensi.1 71.9 44.0 84.0 0.0 37.7 47.6 45.6 55.Speci.0 0.0 55.0 30.0 44.000 Correct 55.0 0.2 41.2 68.0 84.4 84.0 0.200 0.

An ROC curve measures Sensitivity and 1-Speciﬁcity (the false positive rate) across diﬀerent cutoﬀs. speciﬁcity is the percentage of non-accidents were predicted. Odds. sensitivity measures how many events (accidents) were successfully predicted. and Odds Ratios Logistic Regression ROC Curves Sensitivity and Speciﬁcity In this table correct measures the total percentage correct.Department of Biostatistics SAS Logistic Regression .Introduction Probability. Jason Brinkley .

Department of Biostatistics SAS Logistic Regression .Introduction Probability. Odds. and Odds Ratios Logistic Regression ROC Curves Jason Brinkley .

50 or greater. We use this measurement to determine how good a ﬁt this model is to the data. and Odds Ratios Logistic Regression ROC Curves SAS does the ROC curve for you easily with ODS graphics and the ”OUTROC=ROC” option after the model statement. ROC Curves are not limited to logistic regression and aren’t always used in the analysis. Note that in all ROC curve outputs of this type.Introduction Probability. Odds. by being able to determine how well the ﬁtted model makes predictions. Jason Brinkley . We want an area under the curve of 0.Department of Biostatistics SAS Logistic Regression . SAS will tell you the area under the ROC curve. but they are an easy to do diagnostic in SAS.