Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression

Sardar Patel Institute of Technology, Mumbai
Department of Electronics and Telecommunication Engineering
T.E. EXTC (2018-2019)
ETL54-Statistical Computational Laboratory

Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression
NAME: Vishal Ramina | Batch: D | UID: 2017120049
Objective:
(i) To implement Logistic Regression for University Admission Dataset

(ii) To implement Multinomial Logistic Regression for Iris Dataset
Outcomes:
• To carry out logistic regression
• Analyzing various parameters like Accuracy, Misclassification Rate, True
Positive Rate, False Positive Rate, True Negative Rate, Precision,
Prevalence, Null Error Rate, Cohen's Kappa, F Score, etc.
• Understand how to predict probabilities in a multinomial logistic
regression model.
Introduction to Logistic Regression
Logistic regression is a predictive modeling algorithm that is used when the Y

variable is binary categorical. That is, it can take only two values like 1 or 0. The goal
is to determine a mathematical equation that can be used to predict the probability of
event 1. Once the equation is established, it can be used to predict the Y when only
the X’s are known. Logistic regression can be used to model and solve problems.
Limitations of Linear Regression
When the response variable has only 2 possible values, it is desirable to have a model
that predicts the value either as 0 or 1 or as a probability score that ranges between 0
and 1. Linear regression does not have this capability. Because, If you use linear
regression to model a binary response variable, the resulting model may not restrict
the predicted Y values within 0 and 1.
1
Building a Logistic Regression Model in R
Now let's see how to implement logistic regression using the given University
Admission dataset in csv format. The goal here is to model and predict if a given
application (row in dataset) is admit or reject, based on 3 other features. So, let's load
the data and keep only the complete cases.
> mydata<-read.csv("~/Downloads/binary.csv",header=T)
Structure of this dataset:
> str(mydata)
'data.frame': 400 obs. of 4 variables:
$ admit: int 01 1 1 0 1 1 0 1 0 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39...
$ rank : int 33 14421232…
The dataset has 400 observations and 4 columns. The Class column is the response
(dependent) variable and it tells if a given student is accepted or rejected. The column
with rank and admit are numeric int data types. Let us convert them into factors.
> mydata$admit=as.factor(mydata$admit)
> mydata$rank=as.factor(mydata$rank)
> str(mydata)
$ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...

$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 212 3 2 ...
> xtabs(~admit+rank,data=mydata)
rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
2
Partitioning the dataset
We partition the given dataset into training and test data. Before building the logistic
regressor, you need to randomly split the data into training and test samples.
> set.seed(123)
> ind<-sample(2,nrow(mydata),replace=T,prob=c(0.8,0.2))
> ind
[1] 1 112211 21121111 21112211 211111122 1111
11
[39] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 11
1 11
[77] 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 12
1 12
[115] 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 11
1 21
[153] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 11
1 22
[191] 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 11
1 11
[229] 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 21
2 11
[267] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 11
1 11
[305] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 11
2 11
[343] 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 21
1 12
[381] 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2
> traindata<-mydata[ind==1,]
> testdata<-mydata[ind==2,]
> str(traindata)

$ admit: Factor w/ 2 levels "0","1": 1 2 2 2 2 21 12 1 ...
$ gre : int 380 660 800 760 560 540 700 440 760 700 ...
$ gpa : num 3.61 3.67 4 3 2.98 3.39 3.92 3.22 4 3.08 ...
$ rank : Factor w/ 4 levels "1","2","3","4": 3 31 21 32 1 1 2 ...
3
> str(testdata)
$ admit: Factor w/ 2 levels "0","1": 2 1 11 1 2 11 1 1 ...
$ gre : int 640 520 400 800 480 540 500 680 540 760 ...
$ gpa : num 3.19 2.93 3.08 4 3.44 3.81 3.17 3.19 3.78 3.35 ...
$ rank : Factor w/ 4 levels "1","2","3","4": 4 4 2 4 3 134 4 3 ...
Generating the Model
The syntax to build a logit model is very similar to the lm function you saw in linear
regression. You only need to set the family='binomial' for glm to build a logistic
regression model.
glm stands for generalised linear models and it is capable of building many types of
regression models besides linear and logistic regression.
>model<-
glm(formula=admit~gre+gpa+rank,data=traindata,family=bino
mial)
In above model, Class is modeled as a function of admit alone.
Summarizing the model
Call:
glm(formula = admit ~ gre + gpa + rank, family = binomial, data = traindata)
Deviance Residuals:
Min 1Q Median 3Q Max

-1.6388 -0.8867 -0.6350 1.1109 2.0515
Coefficients:
Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.382508 1.258545 -3.482 0.000497 ***
gre 0.001853 0.001213 1.527 0.126687
gpa 1.031450 0.370665 2.783 0.005391 **
rank2 -0.807749 0.359133 -2.249 0.024502 *
rank3 -1.319560 0.385300 -3.425 0.000615 ***
rank4 -1.523491 0.458172 -3.325 0.000884 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
4
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 414.65 on 324 degrees of freedom

Residual deviance: 377.36 on 319 degrees of freedom
AIC: 389.36
Number of Fisher Scoring iterations: 4
(gpa is not significant, so drop it as pvalue is more than 0.05)
> model<-glm(formula=admit~gre+rank,data=traindata,family=binomial) >

summary(model)
Call:
glm(formula = admit ~ gre + rank, family = binomial, data = traindata)
Deviance Residuals:
Min 1Q Median 3Q Max

-1.6054 -0.8949 -0.6930 1.1789 2.0388
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.596438 0.731463 -2.183 0.029071 *
gre 0.003203 0.001114 2.876 0.004023 **
rank2 -0.905751 0.352906 -2.567 0.010272 *
rank3 -1.333377 0.379481 -3.514 0.000442 ***
rank4 -1.629553 0.452996 -3.597 0.000322 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for
binomial family taken to be 1)
Null deviance: 414.65 on 324 degrees of freedom
Residual deviance: 385.41 on 320 degrees of freedom
AIC: 395.41
Number of Fisher Scoring iterations: 4
Utilizing Test Dataset
The model is now built. You can now use it to predict the response on test
data. You need to set type='response' in order to compute the prediction
probabilities.
5
> pred<-predict(model,testdata,type="response")
> head(pred)
4 5 8 11 16 20
0.2357941 0.1736073 0.2277837 0.3399874 0.1990518 0.5332868
Logic function used here is a sigmoid function.The common practice is to take the
probability cutoff as 0.5. If the probability of Y is > 0.5, then it can be classified as an
event.
So if prediction is greater than 0.5, it is admit else it is rejected.
> p1<-ifelse(pred>0.5,1,0)
> head(p1)
4 5 8 11 16 20
0 0000 1
Generating the confusion matrix
> table(p1)
p1
0 1
63 12
> table(testdata$admit)
0 1
57 18
> table(predicted=p1,actual=testdata$admit)
actual
predicted 0 1
0 51 12 tn fp
1 6 6 fn tp
6
Evaluation of performance of a model (based on standard paramters)
Accuracy: 0.6933333
Misclassification Rate: 0.3066667
True Positive Rate: 0.16
False Positive Rate: 0.04
True Negative Rate: 0.96
Precision: 0.66
Prevalence: 0.33
Null Error Rate: 0.66
Cohen's Kappa: 0.1482
F1 Score: 1.835152
Multinomial Regression
Multinomial logistic regression is used to model nominal outcome variables, in
which the log odds of the outcomes are modeled as a linear combination of the
predictor variables.
Description of the data
For our data analysis example, we will be using the preloaded IRIS data
set.
> str(iris)
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
7
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 111111
11
The data set contains variables on 150 flowers. The outcome variable is Species,
factor type with three categories.
Below we use the multinom function from the nnet package to estimate a
multinomial logistic regression model.
> install.packages("nnet")
> library(nnet)
Partitioning the dataset
We partition the given dataset into training and test data. Before building the logistic
regressor, you need to randomly split the data into training and test samples.
> data<-iris
> set.seed(1234)
> ind<-sample(2,nrow(mydata),replace=T,prob=c(0.8,0.2))
> traindata<-data[ind==1,]
> testdata<-data[ind==2,
Training and Summarizing
The model is trained using multinom function from the nnet package.
>model<-
multinom(formula=Species~Sepal.Length+Sepal.Width+Petal.L
ength+Petal.Width,data=traindata)
# weights: 18 (10 variable) initial value
129.636250 iter 10 value 10.683012 iter
20 value 5.933903
8
iter 30 value 5.873500
iter 100 value 5.859118
final value 5.859118
stopped after 100 iterations
> summary(model)
Call:
multinom(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = traindata)
Coefficients:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 20.95306 -2.019734 -12.17769 10.47244 -2.626553
virginica -18.78599 -4.541125 -18.31369 19.53561 14.264642
Std. Errors:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 41.61007 134.8689 191.1050 76.27252 17.95107
virginica 42.22799 134.8825 191.1707 76.48059 19.08432
Residual Deviance: 11.71824

AIC: 31.71824
Predict on Test Data
> pred<-predict(model,testdata,type="class")
> head(pred)
setosa setosa setosa setosa setosa setosa Levels: setosa
versicolor virginica
Confusion Matrix
> tab<-table(pred,testdata$Species)
> tab
Pred setosa versicolor virginica
setosa 11 0 0
9
versicolor 0 6 0
virginica 0 0 15
We have received an accuracy of 100% which can clearly concluded from the above
confusion matrix.
Conclusion:
 Statistical Analysis is the first step to Machine learning.

 Logistic regression jumps the gap by assuming that the dependent variable is a
stochastic event and the dependent variable describes the outcome of this stochastic
event with a density function (a function of cumulated probabilities ranging from 0 to
1).
 Multinomial Logistic Regression (MLR) is a form of linear regression analysis
conducted when the dependent variable is nominal with more than two levels. It is
used to describe data and to explain the relationship between one dependent nominal
variable and one or more continuous-level (interval or ratio scale) independent
variables. It estimates a separate binary logistic regression model for each dummy
variables. Thus the result is M-1 binary logistic regression models.

Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression

Uploaded by

Copyright:

Available Formats

Sardar Patel Institute of Technology, Mumbai

Department of Electronics and Telecommunication Engineering

T.E. EXTC (2018-2019)

ETL54-Statistical Computational Laboratory

NAME: Vishal Ramina | Batch: D | UID: 2017120049

(i) To implement Logistic Regression for University Admission Dataset

Introduction to Logistic Regression

Logistic regression is a predictive modeling algorithm that is used when the Y

Limitations of Linear Regression

Structure of this dataset:

$ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...

'data.frame': 325 obs. of 4 variables:

Generating the Model

Summarizing the model

Min 1Q Median 3Q Max

Estimate Std. Error z value Pr(>|z|)

Null deviance: 414.65 on 324 degrees of freedom

Number of Fisher Scoring iterations: 4

(gpa is not significant, so drop it as pvalue is more than 0.05)

> model<-glm(formula=admit~gre+rank,data=traindata,family=binomial) >

Min 1Q Median 3Q Max

Utilizing Test Dataset

Generating the confusion matrix

Misclassification Rate: 0.3066667

True Positive Rate: 0.16

False Positive Rate: 0.04

True Negative Rate: 0.96

Null Error Rate: 0.66

Cohen's Kappa: 0.1482

Multinomial logistic regression is used to model nominal outcome variables, in

Description of the data

'data.frame': 150 obs. of 5 variables:

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 111111

factor type with three categories.

multinomial logistic regression model.

Partitioning the dataset

Training and Summarizing

virginica 42.22799 134.8825 191.1707 76.48059 19.08432

Residual Deviance: 11.71824

Predict on Test Data

 Statistical Analysis is the first step to Machine learning.

You might also like