You are on page 1of 7

Developing the Decision Model

through Logistic Regression


As part of Weekly Assignments

To be Submitted to
PROF. SREEDHARA R.

Presented by:
Anindya Biswas
1527605, M1
QUESTION:
A pharmaceutical firm that developed particular drug for women wants to understand the
characteristics that cause some of them to have an adverse reaction to a particular drug. They
collect data on 15 women who had such a reaction and 15 who did not. The variables measured
are:

1.
2.
3.
4.

Systolic Blood Pressure


Cholesterol Level
Age of the person
Whether or not the woman was pregnant (1 = No, 2 = Yes)

The dependent variable indicates if there was an adverse reaction (1 = Yes; 0 = No)
B
P
10
0
12
0
11
0
10
0
95
11
0
12
0
15
0
16
0
12
5
13
5
16
5
14
5
12
0
10
0
10
0
95
12
0
12
5
13
0
12
0
12
0

Cholester
ol

Ag
e

Pregna
nt

DrugReacti
on

150

20

160

16

150

18

175
250

25
36

1
1

0
0

200

56

180

59

175

45

185

40

195

20

190

18

200

25

175

30

180

28

180

21

160
250

19
18

2
2

1
1

200

30

240

29

172

30

130

35

140

38

12
5
11
5
15
0
13
0
17
0
14
5
18
0
14
0

160

32

185

40

195

65

175

72

200

56

210

58

200

81

190

73

Results & Analysis


Classification Tablea
PredictedB
Drug Reaction
A

Observed
Step 1

Drug Reaction

No Reaction
No Reaction
Reaction

Overall PercentageC

Percentage

Reaction

Correct

11

73.3

13

86.7
80.0s

a. The cut value is .500

Classification Table:
The classification table is designed to display the overall prediction accuracy level of the model.
The variables involved under this are:
Observed: This indicates the number of 0's and 1's that are observed in the dependent variable
(which in our case is Drug Reaction).
Predicted: These are the predicted values of the dependent variable based on the full logistic
regression model. This table shows how many cases are correctly predicted, and how many
cases are not correctly predicted.
In our model, we can see that, of the 15 women with no reaction, the model correctly
identified 11 of them as not likely to have one. Similarly, of the 15 who did have a reaction,
the model correctly identifies 13 as likely to have one.
3

Overall Percentage: This gives the overall percent of cases that are correctly predicted by
the model. As we can see that the accuracy level has increased from 50% in the null model to
80% of the time overall in the block 1 classification table.

Hosmer and Lemeshow Test


Step
1

Chi-square

df

4.412

Sig.
8

.818

Hosmer Lemeshow Test


The Hosmer and Lemeshow test which divides subjects into 10 ordered groups of subjects and
then compares the number actually in each group (observed) to the number predicted by the
logistic regression model (predicted).
The 10 ordered groups are created based on their estimated probability; those with estimated
probability below .1 form one group, and so on, up to those with probability .9 to 1.0.
If the Hosmer and Lemeshow statistic is greater than .05 as is the rule for well-fitting
models, we fail to reject the null hypothesis that there is no significant difference between
the observed and predicted values.
As in our case, the Hosmer and Lemeshow statistical value is .818 which is greater than .05
which means that it is not statistically significant and therefore our model is quite a good
fit.

Model Summary

Step

-2 Log likelihood

Cox & Snell R

Nagelkerke R

SquareA

SquareB

21.841a

.482

.643

a. Estimation terminated at iteration number 7 because


parameter estimates changed by less than .001.

Model Summary:
Cox and Snells R-Square: It attempts to imitate multiple R-Square based on likelihood,
but its maximum can be (and usually is) less than 1.0, making it difficult to interpret. In our case,
the value of Cox & Snell R Square is .482 which means that 48.2% of the variations in the
dependent variable (DrugReaction) can be explained by our predictor variables.
The Nagelkerke modification that does range from 0 to 1 is a more reliable measure of the
relationship. Nagelkerkes R2 will normally be higher than the Cox and Snell measure.
Nagelkerkes R2 is part of SPSS output in the Model Summary table and is the most-reported of
the R-squared estimates.
In our case, we see that it is indeed more (.643) than the value shown by the Cox & Snell R
Square, which means a more significant amount of variance in the dependent variable can
be explained by the predictor independent variables.

Variables in the Equation


B
Step 1a

BP

S.E.

Wald

df

Sig.

Exp(B)

-.018

.027

.463

.496

.982

Cholesterol

.027

.025

1.182

.277

1.027

Age

.265

.114

5.404

.020

1.304

Pregnant

8.501

3.884

4.790

.029

4918.147

Constant

-26.375

13.680

3.717

.054

.000

a. Variable(s) entered on step 1: BP, Cholesterol, Age, Pregnant.

Variables in the Equation


Since BP and Cholesterol show up as not significant, one can try to run the regression again
without those variables to see how it impacts the prediction accuracy. Since the sample size is
low, one cannot assume that they are insignificant. Walds test is best suited to large sample
sizes.
5

The prediction equation is:


odds of reaction
log ( a drug )=26.375.018 ( BP )+ .027 (Cholesterol ) +.265 ( Age ) +8.501( Pregnant )
As with any regression, the positive coefficients indicate a positive relationship with the
dependent variable. Herein we can say that with age women develop allergies to certain drugs
which might be the reason why we see that age has a positive effect on the odds of reaction to
a drug. Similarly, a pregnant woman might be allergic to many drugs due to the fact that the
infant she is carrying might reject them which possibly explains the positive impact of the
pregnancy factor in the equation.
We shall now calculate the predicted probability by putting the values of BP, Cholesterol, Age
and Pregnant in the above equation. We shall also calculate the predicted group with the cut off
as .500 and see the accuracy for ourselves as shown in the next page:
BP
100
120
110
100
95
110
120
150
160
125
135
165
145
120
100
100
95
120
125
130
120
120
125
115
150
130

Cholesterol

Age

Pregnant

150
160
150
175
250
200
180
175
185
195
190
200
175
180
180
160
250
200
240
172
130
140
160
185
195
175

20
16
18
25
36
56
59
45
40
20
18
25
30
28
21
19
18
30
29
30
35
38
32
40
65
72

1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1

Drug
Reaction

Predicted Probability
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1

0.00003
0.00001
0.00002
0.00023
0.03352
0.58319
0.60219
0.01829
0.00535
0.24475
0.12197
0.40238
0.65193
0.66520
0.30860
0.13323
0.58936
0.85228
0.92175
0.69443
0.76972
0.90642
0.75435
0.98365
0.86545
0.97205

Predicted
Group
0
0
0
0
0
1
1
0
0
0
0
0
1
1
0
0
1
1
1
1
1
1
1
1
1
1

170
145
180
140

200
210
200
190

56
58
81
73

1
1
1
1

1
1
1
1

0.31892
0.62148
0.99665
0.98260

0
1
1
1

As we can see that with the cutoff at .500, the values that lie below .500 are classified as 0
or the ones who do not have any reaction to a particular drug whereas the ones whose
predicted probability value is more than .500 are then classified as 1 or those who would
be having an adverse reaction to the drug. This can be represented graphically as shown
below:

No Reaction to Drug

Adverse Reaction to Drug


1.000

0.000
Cutoff = 0.500

Thus from the above table we can see that the predicted probability of six instances has been
incorrectly predicted bringing down the number of correct predictions to 24 out of 30 or
80% which as shown in the classification table earlier.

FINAL VERDICT:
The model has a significant discriminating power of 80% however it takes certain factors
such as Systolic Blood Pressure and the respondents Cholesterol level as insignificant
factors which are certainly not so in the real life scenario and in many cases is the deciding
factor on whether a particular drug can be administered to a patient or not.
However, this is mainly due to the fact that the Sample Size is too small and if the sample size is
increased, we do believe that eventually even these factors might prove to be significant in
the research process.

You might also like