Professional Documents
Culture Documents
Loqistic Regression
o Extends idea oflinear regression to situation where outcome variable is categorical
Widely used, particularly $,here a strurtured model is useful to explain (=profiling) or to predict
The Logit
Goal: Find a function ofthe predictor variables that relates them to a 0/l outcome
o Instead o1')'as outcome variable (like in linear regression), we use a function ofY called
the logit
o Logit can be modeled as a linear fu:nction of the predictors
o The logit can be mapped back to a probability" which, in tum, can be mapped to a class
a Standard linear fi.rnction (as shown below) does not guarantees 0 < p < l:
The Fix:
we logislic response Junction
7
t |+ r+0zxz- -.Aqxe) prrL,"Lil:11
".-(lo+0i
*,^
1
h': a taiv (rt\
!, I
J
tt'f
gIil)'o'L r 11 )=
Steo 2: The Odds o.I '-t
The odds olan event are defined as: ol,d, {q Htil -r'I t
p
Odds
eq. 10.3 l-p cl oAh r,*ic
Or, given the odds ofan event, the probability ofthe event can be computed by:
r] "Fo+
Step 3: Take los on both sides
This gir,es us the logit:
log(Odds) = 0o + B,x, - B,x., +...+ pqx(1
l . a aD-.
--r--
aalr.t-a
(ttlw :
S -sla
t'
T lon',,|., {"t"liu
ol \
o2
't' ,t{^
ollcr {u nclms can be uu/ ,
x cDt
r, dot;i1 .fucf;,a )
f cunnlalx
-\L U:f)' irrl
\g
{(r)' {ir fln"n* ="'o
0 ---i-
,10f th
Odds (a) and Logit (b) as function ofP
1q)
g)
&
m
g 00
o
*
ao
30
a0
10
0
0 0,1 0.2 0.3 0.4 05 0.6 0.7 0.8 0.9
kobablllt o, sllcc.!.
(r)
"3
Prob*illty ot Succaai
..___...1r)
Example
Personal Loan Offer (UniversalBank.csr')
5
Sinele Model
Data Prep:
#### Table l0 .2
bank. df <- read. csv ( "UniversalBank. csv " )
head (-(bank. df )
> Gad ( bank. df )
tD Age Experience Incone ZI?.Code Family CCAvg Educatior- Morlgage PelsonaI.Loan Securities.Accolrnt CD.AccounE Online
r66itc.rd
125 1 49 91107 4 7-6 1 c 0 0
3 39 15 11 94120 1 1.0 I 3 0 0 0 c
5 35 I 45 91330 4 1.0 2 ! 0 0 0 3
# partj-tion data
set. seed (2 ) # random sample can be reprociuced by settj-ng a value for seed
train. index <- sample (c (1 : din (bank.df ) [ 1] ), dim(bank. df ) t1l *0. 6) [o/, *a
train.df <- bank.df ltrain. index, ] nunhTv o1' Ydwt is h
valid.df <- bank.df [-train. index,
t ]
r!.. tYilrr; r^ 4
:
t n'r', Ss b :t
{ Y- I l
J,u,at r6taavtt r rl: hv',*t r
Seeing the Relationship
09
0.8
a7
J 0ai
0.5
o
04
03
02
01
0
C 50 tl:,o 150 200 250
lncom. {in 3000s}
Estimation: Estimates of b's are derived through an iterative process called marimum likelihood
estimalion
5
> # run foqistic regression
> * use glmo (general linear mociel) with family = "binomial" to fit a logistic
> * reqression. a.g[ xrr,,nl r,'in.r,
I ^u,,^lrkt
iu ( > logit.reg <- glm (Persona.l . Loan - ., data - train.df, family = "blnomial")
> opti-ons (scipe;-999) 4 t,lutn ,{t e noffin^ l
o\ it't t ,l^,,o
> sumnary (logit. reg)
*w= a
Y= 0
CaIl:
g1m(formula : Personal.Loan - famjly - "binomiaI", data = tlain.Cf)
Deviance ResiduaLs:
Min 1Q Median 3Q Max
-2.0380 -0.184? -0.0621 -0.0183 3.9810
Coeffic.ients: P4
Est imate sr d. Error z vaLue Pr(>lzl)
(Intercept) -72 .6845628 2 .2903310 -s. s37 0.0000000308
Age -0.0369346 0 .0848937 -0.435 0.66351
Experience 0 .04 90 64 5 0 .0844410 0. 581 0.56L2t
Incone 0 .0612953 0 .4o391 62 15.416 < 0.0000000000000002
t anrly 0 .5 434651 0 .a994936 3. az
t o. ooooooo4To
CCAvg 0 .2765942 0 .0601900 3. s99 0.00032
EducationGraduate .2681068 0 .370337t u. 525 < 0.0000000000000002
EducatioEdvanced/
.-- Profes s ional ; .4408154 0 .3123360 \.921 < 0 . 000 00 00 000 00002
0
Mortgage 0 .0c154 99 0 . o00'1 926 1.955 0.05052
Securities . Account -1 .L451416 0 .3955796 -2-896 0, 00377
CD. Account 4 .5855656 0 ,4'71'7 69e, 9.598 < 0. 0000000000000002
onllfi6 -0 .8588074 0 .279t211 -3.919 0.0000888005
C!editCard -1 .25t42t3 0 .2944'7 6'1 -4 -250 0 . 00 00214111
6
Evaluatine Classification Performance
Performance measures: Confusion matrix and 7o of misclassifications
# some libraries
insta1l. packages ("e107 L " )
Iibrary(e107L) * misc functions in stat, prob
install . packages ("Rcpp" )
library(Rcpp) # R and c++ integration
install. packages ( "caret" )
Iibrary(caret) # Classfication And REgression Training
# evaluate
+ predict (logit.reg, valid.df) assumes a Iinear regression prediction
# adding type = "response" means 1og (odds) is linear regression, hence
# the prob follows a logistic regression model
ttJc lcliskc ulxt"ta
p,j$l-.-_ntSg-r"t (Iogit.reg, valid.df , type = "response") to p,!l:r+i !^ w"N-ff A
* the same as
# predl <- predict (Iogit,reg, va.id.df[,-8], type : "response" l 'l^lt"
t
# evaluate r ctq,., n 8 r, ?eq^l ,L&,a
> # get the confusion Matrix usinl table tY)
> c.mat .- !3!] (ifeJ-se (pred > 0.5, 1, 0), valid.df[,8] )
> c.mat "
hclnnt i+ y{\&rl proL > o-5 o alodtl l't
01 ol\crntr. 't ott
0 L779 58
r d"kl 1 30 133
7
+ROC Curve and other capabiLitres of "Informationvalue"
instal.L packaqes (" InformationValue " )
Iibrary ( I n fo rmationVal-ue )
plotROC (vaIid.df [. 8] . pred)
t dt((w,"t 'ut'fi
t,
viohil"l+'t ?
0 7a
chtJiu'*;e;
u
o-
t
T., [. E'*
a
c
a0) 0
0 2i
Brl
$optimalCutoff
[1] 0 . {7 99{8{
0. 9823107
0.0435
0.0430
0.043s
a towrrF
55 0.459948397 o - 01824212 o."t!.12't'15 0.699035364 0.9817579 0. 04 35
ttn\e -clqil,1\
56 ,-,* 1.,
output skipped
9s 0.059943397 0.129906 0.9 c05236 0. ?7 0 617 535 0.8700940 0.12?0
w u\bt
96 0.049943391 0.1448314 0.9 162304 0.771398958 0.8551686 0.1390
97 0.03994939? 0- 1558375 0. 92i455 0.r55628489 0.834162s 0.1575 ^ccuydrl
98 0.029948397 0.2012161 0 2 6't 0t6 a.125485429 0.7987839 0.1890
99 0.019948397 0.2399116 0 424084 a,-t02496824 0.7600884 0.222s
100 0.009943397 0.3228303 0 .9685864 0.645756094 o.6'7'7169'1 0.2950
$TPR
ltl 0.172a419 1t /o
$ FPR
[1] 0.017136s4 tt/"
$ Spec j- f ic j- ty
+ The maximization criterion for which probability cutoff score needs tc be optimised.
Can take either of following values: "Ones" or "Zeros" or "Both,'or
>* "misclasselror" (default). If "O!e9" is used, 'optimalcutoff' will be chosen to
maximise detection of "One's". If 'Bcth' is specified, the probabilj.ty cut-off
1t1,
> + that gives maximum Youden's Index is chosen. If'misclasserror' is specified,
> * the prooability cut-off that gives roinimum mis-claaificatioo error is chosen.
9
Convertine to Probabilifv
Odds
p
l+ Odds
10
lntemretinq Odds. Probabilitv
For predictive classification, we typicalll,use probability rvith a cutoffvalue
The "Iift" over the base curve indicates for a given number of cases (read on the x-axis), the
additional responders that you can identifu by using the model.
The same information is portra)'ed in in Decile-wise lift chart: Taking the ro% of the records
that are ranked by the model as "most probable t's" yields 7.9 times as many 1's as rvould simply
selecting ro* ofthe records at random.
11
Lift chart (validation dataset)
250
200
ailt! Y'fnar<
0) Personal Loan
-Cumulative
.I when sorted
vtI.itt<
G 150
I f ''fu
rf )to untf{ using predicted
E 100 values
J
o 50 t Personal Loan
-Cumulative
0 using average
0 1000 2000 3000
# cases
E 10
.o
-9
o
I
6
EE 4
Etr
-9 2
o
o
o 0
1234 56 78910
Deciles
72
##*# Eiqure 10.3
> lnstal1 . packages ( "gains " ) * to get lift chart
library(gains) y iv fu
f
galn <- gains lvalid. df$ Personal.l,oan, logit.reg.pred,
V^1.,1.fr^
groups-10)
c.0. . gain. cune. pct. of . Lotal. . . suni. va1id. Cf, Persona L , Loan. . c.0..gain.cu.ne.obs.
1 0 0
2 151 200
3 112 400
A 180 600
5 185 800
6 18? 1000
1 187 7240
I 189 1400
190 1600
10 191 1800
11 191 20c0
Y -a*'., X - it\\J
data.frame( sum(vafid.df$Personal.Loan) ) , c(0, dim(valid.df) t1l )
c (0, )
drlt4
13
V,rl l rt-;tli r. rn'ry lt
pl;riil f uixtr ltu wr*'t
i ca'tcd
Llltat
n \
I T
o
rJ) mrirrg Y'lo'^^n''
1t*'l
{,)
'r: o
ood- tl )tao tit,
!o
E I. hYtnn'lc
(no wLl) 'i'l.t. tt\t- lA,
O te
fr,fYnran " lt rN<
o
lr,
# cases
> gainsmean. re sp
t1l 0.755 0.105 0.010 0.125 0.010 0.000 0.c10 0.00s 0.005 0.000
> gainSmean. resp* 2 0 0
[1] 151 27 B 5 2 O 2 7 1 C
> mearr (va 1id. df S Persona I . Loan )
[1] 0.09ss
> height s
[1]
7.90575916 r.0994'1644 0.41881111.7 0.261r8010 0.104112A4 0.00000000 0.10471204
0.05235602 0.45235602 0.0C000000
1,4
midpolnts <- barplot (heig:rts, nanes.arg = gain$depth, ylim = c(0,9).
xlab = "Percentilerr, ylab = "Nlean Response", main: "Dec.i.Le-wise lift chart")
79
@
(o
o
C
o
o-
c,
t
ES
C'
c)
a'l
11
0.4
u'r o 1 n o.l 0.1 0.1 o
o ffim B:Ei!, Er,*E @
10 20 30 40 50 - 60 7A 80 90 100
Percentile
In gains package:
Actual: a numeric vector ofactual response values
Predicted: a numeric vector of predicted response values. This vector must have the same length
as actual, and the ith value of this vector needs to be the model score for the subject with the ith
value ofthe actual vector as its actual response.
Groups: an integer containing the number of rows in the gains table. The default value is 10.
15
Multicollinearity
Solution: Remove extreme redundancies (by dropping predictors via variable selection. or by
data reduction methods such as PCA)
Variable Selection
This is the same issue as in linear regression
I . The number of correlated predictors can grow when we create derived variables such as
interaction terms (e.9. Income x Famib), to captvre more complex relationships
2. Problem: Overly complex models have the danger of overfitting
3. Solution: Reduce variables via automaled selection of variable subsets (as with linear
regression)
4. SeeChapter6 tibverl I l4fttt>
claoAaC
I
Summan'
I . Logistic regression is similar to linear regression, except that it is used with a categorical
response
16
Problems
l. Financial Condition of Banks. The fie Banks.c.sv includes data on a sample of
20 banks. The "Financial Condition" column records thejudgment ofan expen on the
financial condition ofeach bank. This outcome variable takes one of two possible
values-ryeak or strong-according 10 the financial condition of the bank. The predictors
are two ratios used in the financial analysis ofbanks: Totlns&Lses/Assets is the ratio of
total loans and leases to total assets and TotExp/Assets is the ratio oftotal expenses to
total assets. The target is to use the trvo ratios for classifuing the financial condition ofa
new bank.
Run a logistic regression model (on the entire dataset) that models the status ofa bank as
a function ofthe two financial measures provided. Speciff the saccess class as wealr (this
is similar to creating a dummy that is I for tinancially weak banks and 0 otherwise), and
use the default cutoff value of 0.5.
a. Write the estimated equation that associates t}re financial condition of a banl< with
its t\ro predictors in three formats:
i. The logit as a function ofthe predictors
ii. The odds as a function ofthe predictors
iii. The probability as a tirnction ofthe predictors
b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and
total expenses/assets ratio :0.11. From your logistic regression model, estimate
the lbllowing four quantities for this bank (use R to do all the intermediate
calculations; show your final answers to four decimal places): the logit, the odds,
the probability ofbeing financially weak, and the classification ofthe bank (use
cutoff = 0.5).
c. The cutoff value o10.5 is used in conjunction with the probability ofbeing
financially weak. Compute the threshold that should be used if u,e want to make a
classification based on the odds ofbeing financially weak, and the threshold for
the corresponding logit.
d. Interpret the estimated coefllcient for the total loans & leases to total
assets ratio (Totlns&LsesiAssets) in terms ofthe odds ofbeing financially weak.
e. When a bank that is in poor financial condition is misclassified as
financially strong. the misclassitication cost is much higher than when a
financially strong bank is misclassified as weak. To minimize the expected cost of
misolassification, should the cutof-f value for classification (which is currently at
0.5) be increased or decreased?
2. ldentifying Good System Administrators. A management consultant is studying
the roles played by experience and training in a system administrator's ability to
complete a set oftasks in a specified amount cr1'time. In particular, she is interested in
discriminating between administrators who are able to complete given tasks within a
specified time and those who are not. Data are collected on the performance of 75
randomly selected administrators. 1'hey are stored in lhe {rle SystemAdministators.csv.
1,7
The variable Experience measures months of lull-1ime system administrator experience,
while Training measures the number of relevant training credits. The outcome variable
Completed is either les or -Nb, according to whether or not the administrator completed
the tasks.
a. Create a scatter plot of Experience vs. Training using color or symbol to
distinguish prograrnmers u'ho completed the task from those who did not
complete it. Which predictor(s) appea(s) potentially useful for classiflying task
completion?
b. Run a logistic regression model with both predictors using the entire dataset as
training data. Among those who completed the task, what is the percentage of
programmers incorrectly classified as failing to complete the task?
c. To decrease the percentage in part (b), should the cutoff probability be increased
or decreased?
d. How much experience must be accumulated by a programmer with 4 years of
training before his or her estimated probability of completing the task exceeds
0.5?
3. Sales of Riding Mowers. A company that manufactures riding mowers wants to
identify the best sales prospects lbr an intensive sales campaign. In particular. the
manufacturer is interested in classili'ing households as prospective owners or nonowners
on the basis of Income (in $1000s) and Lot Size (in 1000 ft'). The marketing expert
looked at a rardom sample of 24 households, given in the flJ.e RidingMower.r.csv. LIse a[[
the data to fit a logistic rcgression o1'ownership on the two predictors.
a. Wtat percentage of households in the study were owners of a riding mou,er?
b. Create a scatter plot of Income vs. Lot Size using color or symbol to distinguish
orvners from nonowners. From lhe scatter plot, which class seems to have a
higher average income, ou'nors or nonowners?
c. Among nonowners, what is the percenlage of households classified correctly?
d. l'o increase the percentage ofcorrectly classified nonowners, should the cutoff
probability be increased or decreased?
e. What are the odds that a household with a $60K income and a lot size of 20,000
ft'is an ou.ner?
f. What is the classification of a household with a $60K income and a lot size of
20,000 ft?? Use cutoff :0.5.
g. What is the minimum income that a household with 16,000 ft? lot size should
have before it is classified as an owner?
18